1 Introduction

In this workshop, we will introduce you to spatial data analysis and visualization using R Studio. We will do so through a substantive exploration of systemic racism in traffic police stops. In particular, we will take a large (over 3,000,000 observations) dataset of Colorado traffic stops released by the Stanford Open Policing Project, and develop a simple county-level measure of the magnitude of anti-Black racial bias in traffic police stops during the year 2010. We will then display county-level variation in this bias index on a map of Colorado counties.

1.1 Workshop Objectives

The workshop has several learning objectives. Among other things, you will learn how to:

  • Read in datasets (both tabular and spatial) into R Studio
  • Clean and process data in R Studio using tidyverse functions (i.e. subset data, reshape data, summarize datasets, join different datasets together, and define new variables)
  • Make static maps and interactive web maps in R Studio using the tmap package
  • Export your maps from R Studio so they can be embedded in your reports and websites

Most broadly, and most importantly, you will gain an appreciation for how to use real-world social scientific geospatial data to better understand important social problems, and use this understanding to reflect on possible ways to address these problems.

1.2 Data

The workshop makes use of the following three datasets:

  • Colorado State Patrol traffic stops data, available through the Stanford Open Policing Project (mentioned above). The data was collected by Pierson et al. (2020). The data is available at the Project’s data page, under the “CO” heading; if you’re downloading the data directly from the website, be sure to use the “State Patrol” data.
  • A dataset of county-level demographic data (based on the 2010 decennial census), which your instructors generated prior to the workshop. If you’re interested, please consult the Appendix to this tutorial to see how this demographic data was extracted from the census.
  • A spatial dataset of Colorado counties, available from the US census via the Colorado GeoLibrary, a curated online archive of Colorado-related spatial data. Note that this spatial dataset is stored as a shapefile, which is a common file format used to store spatial data.

For convenience, all of this data can be downloaded from this Dropbox Folder.

1.3 Install R and R Studio

In order to follow along with the lesson, you must install both R and R Studio. Please see the installation instructions from this Data Carpentry lesson for additional guidance regarding R and R Studio installation.

1.4 The R Studio Interface

If you haven’t used R Studio before, you can take a quick tour of the R Studio interface and get oriented by consulting this Data Carpentry tutorial titled Introduction to R and R Studio. See, in particular the section of the tutorial titled “Introduction to R Studio.”

1.5 Install and load packages

In order to carry out the analysis presented in the workshop, you must install the tidyverse (a suite of data science-related R packages), sf (a package used to load and process spatial datasets in R), and tmap (a package that facilitates mapping and spatial data visualization in R). If you haven’t already installed one or more of these packages, you can do so by placing the package name in quotes as an argument to the install.packages() function.

For example, if you wanted to install tmap, you would print the following in your R Studio console, or run it from a script (which you can open by clicking File in your R Studio menu bar, selecting New File, and then clicking R Script):

install.packages("tmap")

After installing the required package, you must load the packages into memory, which you can do with the following:

# Loads libraries
library(tidyverse)
library(tmap)
library(sf)

Note that you only need to install packages once on a given computer, and (usually) do not need to reinstall packages after quitting an R session and opening up a new one. However, you do need to reload any necessary libraries each time you open a new R session.

If you would like additional information about how R packages work, please see the Installing Packages section in this Data Carpentry tutorial.

1.6 Set working directory

Before we can bring our data into R Studio and begin the tutorial, we have to specify the location of the relevant data on our computer. This step is known as setting one’s working directory. Before setting the working directory, make sure that you’ve downloaded the data provided at the folder mentioned above to a directory on your computer.

If you’re unfamiliar with the concept of file paths, the easiest way to set your working directory is through the R Studio menu. To do so, follow these steps:

  • First, click on the Session menu on the R Studio menu bar at the top of your screen, and then scroll down to the Set Working Directory button in the menu that opens up.
  • When you hover over the Set Working Directory button, a subsidiary menu that contains a button that says Choose Directory will open; click this Choose Directory button.
  • In the dialog box that opens up, navigate to the directory that contains the downloaded workshop data, select it, and click “Open”. At this point, your working directory should be set!

Alternatively, if you are familiar with the concept of file paths, and know the file path to the folder containing the downloaded datasets, you can set the working directly using the setwd() function, where the argument to the function is the relevant file path enclosed in quotation marks. For example:

# Setting the working directory programmatically
setwd("<FILE PATH TO DATA DIRECTORY HERE>")

At this point, we’re ready to begin the main part of our lesson!

2 Loading data

Let’s begin by reading in the dataset on State Patrol traffic stops released by the Stanford Open Policing Project. The data can be downloaded from the Stanford Open Policing Project data page, but has also been made available to you as part of the workshop materials. This data is stored as a CSV file.

Once the traffic patrol data has been downloaded into your working directory, pass the name of the file (along with its extension) to the read_csv() function, and assign it to an object. Here, we’ll assign the traffic patrol data to an object named co_traffic_stops. Note that the name of an object is arbitrary, but ideally, it should meaningfully describe the data that has been assigned to it.

# Read in Stanford police data for Colorado and assign to object named 
# "co_traffic_stops"
co_traffic_stops<-read_csv("co_statewide_2020_04_01.csv")

── Column specification ────────────────────────────────────────────────────────
cols(
  .default = col_character(),
  date = col_date(format = ""),
  time = col_logical(),
  subject_age = col_double(),
  arrest_made = col_logical(),
  citation_issued = col_logical(),
  warning_issued = col_logical(),
  contraband_found = col_logical(),
  search_conducted = col_logical()
)
ℹ Use `spec()` for the full column specifications.

Once the traffic patrol data has been read into R studio and assigned to an object, we can print the contents of the dataset to the console by typing the name of that object into the R Studio console or by printing (and running) the name of the object from a script (note that only the first few records will be printed to the console):

# Print the contents of "co_traffic_stops" (i.e. the CO traffic patrol data) 
# to the console; the first few records of the dataset will print
co_traffic_stops
# A tibble: 3,112,853 × 20
   raw_row_number date       time  location county_name subject_age subject_race
   <chr>          <date>     <lgl> <chr>    <chr>             <dbl> <chr>       
 1 1947986|19479… 2013-06-19 NA    19, I70… Mesa County          26 hispanic    
 2 1537576        2012-08-24 NA    254, H2… Jefferson …          NA <NA>        
 3 1581594        2012-09-23 NA    115, I7… Logan Coun…          52 white       
 4 1009205        2011-08-25 NA    197, H8… Douglas Co…          32 white       
 5 1932619        2013-06-08 NA    107, H2… Kiowa Coun…          33 hispanic    
 6 1179436        2011-12-23 NA    48, 384… Boulder Co…          NA <NA>        
 7 1326795        2012-04-07 NA    0, R250… Boulder Co…          39 white       
 8 1786795        2013-03-03 NA    19, E47… Arapahoe C…          44 white       
 9 1552164        2012-09-02 NA    224, H2… Park County          NA <NA>        
10 1004281|10042… 2011-08-21 NA    R2000, … Adams Coun…          32 hispanic    
# … with 3,112,843 more rows, and 13 more variables: subject_sex <chr>,
#   officer_id_hash <chr>, officer_sex <chr>, type <chr>, violation <chr>,
#   arrest_made <lgl>, citation_issued <lgl>, warning_issued <lgl>,
#   outcome <chr>, contraband_found <lgl>, search_conducted <lgl>,
#   search_basis <chr>, raw_Ethnicity <chr>

We can also view the co_traffic_stops object (or, for that matter, any dataset in R Studio) within the R Studio data viewer by passing the name of the relevant object to the View() function:

# Inspect co_traffic_stops in the R Studio data viewer
View(co_traffic_stops)

3 Cleaning and filtering data

To make things tractable (after all, the entire dataset has more than 3,000,000 observations!), let’s focus our attention on Colorado traffic patrol from the year 2010; this also has the advantage of allowing us to eventually use 2010 census data in crafting our measure of racial bias in traffic stops. Note that currently, there isn’t a separate field which contains the year in which a particular stop took place; rather, there is a “date” field, in which the date is stored in YYYY-MM-DD format. The easiest way to extract observations from the year 2010 is therefore to first extract the YYYY information from the “date” field, and use that information to make a new “Year” field that only contains the year of a given stop; we can then extract the 2010 observations based on this newly generated “Year” field.

3.1 Create a “Year” field

To create this new “Year” field within co_traffic_stops, we can use the mutate() function, which is a tidyverse function that allows us to define new variables within a dataset, and the substr() function, which allows us to extract a subset of a given string.

Below, we take the data in the co_traffic_stops object, and then create a new column named “Year” using the mutate() function; we set this column equal to the the first four digits of the existing “date” column by passing the expression that reads co_traffic_stops$data, 1, 4 to the substr() function. We then use the the assignment operator (<-) to assign this change back to the co_traffic_stops object, which permanently updates the dataset with the addition of the new “Year” field:

# Creates "Year" field, that contains the year of a given stop, 
# in "co_traffic_stops"
co_traffic_stops<-co_traffic_stops %>% 
                    mutate(Year=substr(co_traffic_stops$date, 1,4))

You might have noticed a mysterious symbol in the above code that comes immediately after co_traffic_stops and immediately before calling the mutate() function in the next line. This symbol is known as a “pipe” (%>%). The pipe operator effectively takes the contents to its left, and then uses these contents as an input to the code on its right. Here, the pipe takes the contents of co_traffic_stops object on its left, and then feeds this data into the mutate() function on its right. In other words, the pipe operator links the code on its left to the code on its right, and establishes that the data which is to be modified using the mutate() function is the data assigned to co_traffic_stops. We will use the pipe operator throughout the lesson to chain together functions in this manner.

Let’s check the updated co_traffic_stops object and make sure that the new field has been successfully created:

# prints contents of "co_traffic_stops"
co_traffic_stops
# A tibble: 3,112,853 × 21
   county_name      date       Year  raw_row_number   time  location subject_age
   <chr>            <date>     <chr> <chr>            <lgl> <chr>          <dbl>
 1 Mesa County      2013-06-19 2013  1947986|1947987  NA    19, I70…          26
 2 Jefferson County 2012-08-24 2012  1537576          NA    254, H2…          NA
 3 Logan County     2012-09-23 2012  1581594          NA    115, I7…          52
 4 Douglas County   2011-08-25 2011  1009205          NA    197, H8…          32
 5 Kiowa County     2013-06-08 2013  1932619          NA    107, H2…          33
 6 Boulder County   2011-12-23 2011  1179436          NA    48, 384…          NA
 7 Boulder County   2012-04-07 2012  1326795          NA    0, R250…          39
 8 Arapahoe County  2013-03-03 2013  1786795          NA    19, E47…          44
 9 Park County      2012-09-02 2012  1552164          NA    224, H2…          NA
10 Adams County     2011-08-21 2011  1004281|1004282… NA    R2000, …          32
# … with 3,112,843 more rows, and 14 more variables: subject_race <chr>,
#   subject_sex <chr>, officer_id_hash <chr>, officer_sex <chr>, type <chr>,
#   violation <chr>, arrest_made <lgl>, citation_issued <lgl>,
#   warning_issued <lgl>, outcome <chr>, contraband_found <lgl>,
#   search_conducted <lgl>, search_basis <chr>, raw_Ethnicity <chr>

Note the newly created Year field above. We can also check to make sure that the “Year” field has been successfully created by viewing co_traffic_stops in the R Studio data viewer with View(co_traffic_stops).

3.2 Filter by year

Now that we have created the “Year” field, we can use it to extract the 2010 observations using the filter() function , and assign the filtered dataset of 2010 stops to a new object named co_traffic_stops_2010:

# Extract 2010 observations and assign to a new object named 
# "co_traffic_stops_2010"
co_traffic_stops_2010<-co_traffic_stops %>% filter(Year==2010)

When we print the contents of the newly created co_traffic_stops_2010 object, note that the observations are now only from 2010.

# Print contents of "co_traffic_stops_2010" object
co_traffic_stops_2010
# A tibble: 470,284 × 21
   date       Year  county_name      subject_race raw_row_number  time  location
   <date>     <chr> <chr>            <chr>        <chr>           <lgl> <chr>   
 1 2010-04-17 2010  Montezuma County white        188721|188722   NA    2, 989,…
 2 2010-04-17 2010  Montezuma County white        187958          NA    991, 32 
 3 2010-04-17 2010  Montezuma County hispanic     188451          NA    9, 280,…
 4 2010-04-17 2010  Montezuma County white        186989|186990|… NA    3, 277,…
 5 2010-04-17 2010  Montezuma County white        186997|186998|… NA    3, 277,…
 6 2010-04-17 2010  Montezuma County white        186993|186994|… NA    3, 277,…
 7 2010-12-21 2010  Mineral County   <NA>         600865          NA    164.5, …
 8 2010-12-21 2010  Mineral County   <NA>         600477          NA    163, 29…
 9 2010-01-20 2010  Pueblo County    hispanic     36625|36626     NA    312, H5…
10 2010-01-01 2010  Chaffee County   white        275             NA    127, H2…
# … with 470,274 more rows, and 14 more variables: subject_age <dbl>,
#   subject_sex <chr>, officer_id_hash <chr>, officer_sex <chr>, type <chr>,
#   violation <chr>, arrest_made <lgl>, citation_issued <lgl>,
#   warning_issued <lgl>, outcome <chr>, contraband_found <lgl>,
#   search_conducted <lgl>, search_basis <chr>, raw_Ethnicity <chr>

4 Transforming Data

Now that we have created a new dataset that contains information on traffic stops from our year of interest (2010), let’s do a bit of work with the data so that we can find out the total number of traffic stops within each county, and the number of stops within each county that involved a Black motorist.

4.1 Tabulate county-level count of traffic stops by race

First, let’s find out the racial breakdown of stops for each county, using the data in the “subject_race” field of co_traffic_stops_2010.

Below, we take the co_traffic_stops_2010 object, declare “county_name” as a grouping variable using group_by(county_name), and then count up the number of stops associated with each racial category within each county using count(subject_race). The dataset thats results from these operations is assigned to a new object named co_county_summary:

# Compute county-level count of traffic stops by race
co_county_summary<-co_traffic_stops_2010 %>% 
                    group_by(county_name) %>% 
                    count(subject_race) 

Let’s print the first few rows from the dataset and observe its structure:

# Prints contents of "co_county_summary"
co_county_summary
# A tibble: 439 × 3
# Groups:   county_name [65]
   county_name    subject_race               n
   <chr>          <chr>                  <int>
 1 Adams County   asian/pacific islander   582
 2 Adams County   black                   1208
 3 Adams County   hispanic                8012
 4 Adams County   other                     36
 5 Adams County   unknown                  462
 6 Adams County   white                  20225
 7 Adams County   <NA>                    3825
 8 Alamosa County asian/pacific islander    18
 9 Alamosa County black                     43
10 Alamosa County hispanic                1537
# … with 429 more rows

Note that each row contains information on the number of times a person from a given racial category was stopped by the traffic police in a given county. For example, in Adams County, CO, Asian/Pacific Islander motorists were stopped 582 times, Black motorists were stopped 1208 times, and white motorists were stopped 20225 times. The dataset contains information on the racial and ethnic breakdown of police stops for each county in Colorado, in the year 2010.

4.2 Reshape the data

Note that co_county_summary is currently a “long” dataset, in which there are multiple rows associated with each county; each row corresponds to a distinct county/racial category combination, and the n column provides information on the number of stops associated with that county/racial category pairing.

It will be easier to instead work with a “wide” dataset, in which each county is associated with a single row, and each racial category is assigned to its own column. Each cell in this “wide” dataset would correspond to the number of stops associated with a given county (defined by the row) for a given racial category (defined by the column).

To reshape the dataset from its current “long” format into a “wide” format, we can use the pivot_wider() function. The code below takes the current co_county_summary data object (in “long” format), and then transforms it into a “wide” format with the expression that reads pivot_wider(names_from=subject_race, values_from=n). The names_from argument specifies the name of the current column which contains the categories which we want to transform into columns (here, subject_race), while the values_from argument specifies the name of the current column which contains the values that will be associated with each county/racial category combination (here, n). Finally, the code below assigns the transformed dataset to a new object named co_county_summary_wide:

# Transforms "co_county_summary" from long format to wide, and assigns 
# the reshaped dataset to a new object named "co_county_summary_wide"
co_county_summary_wide<-co_county_summary %>% 
                        pivot_wider(names_from=subject_race, values_from=n)

Let’s print the contents of co_county_summary_wide to ensure that the data has indeed been transformed into a “wide” format:

# prints contents of "co_county_summary_wide"
co_county_summary_wide
# A tibble: 65 × 8
# Groups:   county_name [65]
   county_name       `asian/pacific i…` black hispanic other unknown white  `NA`
   <chr>                          <int> <int>    <int> <int>   <int> <int> <int>
 1 Adams County                     582  1208     8012    36     462 20225  3825
 2 Alamosa County                    18    43     1537     9      30  2427   414
 3 Arapahoe County                  540  1819     1862    12     300 11089  1898
 4 Archuleta County                  17    28      392    71      41  4125   417
 5 Baca County                       11    61      288    NA       6   971   174
 6 Bent County                        8    46      314     1       6  1155   278
 7 Boulder County                   345   192     1050    10     180  9682  1594
 8 Broomfield County                 32    22      104     3      18   690   226
 9 Chaffee County                    43    37      361     9      71  4806  1194
10 Cheyenne County                   10    38      147     3       2   821    85
# … with 55 more rows

Indeed, we can see that each row is now associated with a single county, and each column is now associated with a given racial category.

4.3 Calculate total stops for each county in co_county_summary_wide

Now that we have our dataset in “wide” format, let’s create a new column, named “total_stops” that contains information on the total number of traffic stops for each county. We can create this column by calculating the sum total of stops across all of the racial categories for each county, which yields the total number of stops for each county.

The code below takes the co_county_summary_wide object, and then calls the rowise() function, which allows us to make calculations across each row in a data frame. It then creates a new column, called “total_stops” using the now-familiar mutate() function; this “total_stops” column is populated by taking the sum of traffic stops across racial categories for each county. This is accomplished with the expression that reads sum(c_across(where(is.integer)), na.rm=TRUE). This expression can be translated as follows: “for each row in the dataset, calculate the sum across the columns whenever the value in a column is an integer; whenever a cell value is ‘NA’, simply ignore it in the calculation.” We’ll assign the dataset, with the newly added “total_stops” column, back to the same co_county_summary_wide object; this effectively overwrites the contents of the current co_county_summary_wide object (a dataset without the “total_stops” column), with the new dataset (which does have the newly populated “total_stops” column).

# Takes the existing "co_county_summary_wide" dataset, and creates a new column 
# called "total_stops" that sums the values across columns for each row; 
# the revised dataset is assigned back to "co_county_summary_wide", 
# which overwrites the object's previous contents with the revised dataset
co_county_summary_wide<-co_county_summary_wide %>% 
                        rowwise() %>% 
                        mutate(total_stops=sum(c_across(where(is.integer)), 
                                               na.rm=TRUE)) 

Let’s print the contents of the co_county_summary_wide object and confirm that the new column has been successfully created:

# Prints updated contents of "co_county_summary_wide"
co_county_summary_wide
# A tibble: 65 × 9
# Rowwise:  county_name
   county_name   total_stops `asian/pacific…` black hispanic other unknown white
   <chr>               <int>            <int> <int>    <int> <int>   <int> <int>
 1 Adams County        34350              582  1208     8012    36     462 20225
 2 Alamosa Coun…        4478               18    43     1537     9      30  2427
 3 Arapahoe Cou…       17520              540  1819     1862    12     300 11089
 4 Archuleta Co…        5091               17    28      392    71      41  4125
 5 Baca County          1511               11    61      288    NA       6   971
 6 Bent County          1808                8    46      314     1       6  1155
 7 Boulder Coun…       13053              345   192     1050    10     180  9682
 8 Broomfield C…        1095               32    22      104     3      18   690
 9 Chaffee Coun…        6521               43    37      361     9      71  4806
10 Cheyenne Cou…        1106               10    38      147     3       2   821
# … with 55 more rows, and 1 more variable: `NA` <int>

4.4 Clean co_county_summary_wide and assign to new object

Let’s clean up the co_county_summmary_wide dataset a bit more, so as to to make it even easier to work with. First, because our interest in this workshop is in exploring the possibility that Black motorists suffer disproportionately high traffic stop rates (relative to their share of the overall adult population), let’s create a new object that only contains the data that is essential for the subsequent analysis: the county’s name (“county_name”), the number of Black motorists that were stopped (“black”), and the total number of stops in the county across all racial categories ("total_stops).

The code below takes the existing co_county_summary_wide object, and then uses the select() function to select the columns we want to keep. It then assigns this selection to a new object named “co_county_black_stops”:

# Selects the "county_name", "black", and "total_stops" columns 
# from the "co_county_summary_wide" object, and assigns the selection 
# to a new object named "co_county_black_stops"

co_county_black_stops<-co_county_summary_wide %>%
                        select(county_name, black, total_stops) 

Let’s print the contents of the newly created co_county_black_stops object to ensure that these changes have been successfully implemented:

# Prints contents of "co_county_black_stops"
co_county_black_stops
# A tibble: 65 × 3
# Rowwise:  county_name
   county_name       black total_stops
   <chr>             <int>       <int>
 1 Adams County       1208       34350
 2 Alamosa County       43        4478
 3 Arapahoe County    1819       17520
 4 Archuleta County     28        5091
 5 Baca County          61        1511
 6 Bent County          46        1808
 7 Boulder County      192       13053
 8 Broomfield County    22        1095
 9 Chaffee County       37        6521
10 Cheyenne County      38        1106
# … with 55 more rows

As expected, we now have a dataset that contains information on the total number of traffic stops for each Colorado county in 2010 (“total_stops”), and the number of traffic stops that involved Black motorists (“black”).

The name of the column containing information on the number of Black motorists stopped by the traffic patrol is simply “black”, which was inherited from the initial Stanford Open Policing Project dataset. However, this column name is somewhat vague, and may cause confusion down the road when we work with census demographic data to generate our index of racial bias in traffic stops. Therefore, let’s rename that column name to “black_stops” using the rename() function, and assign that change back to co_county_black_stops:

# Takes the existing "co_county_black_stops" object and renames the column named 
# "black" to "black_stops"; assigns the modified dataset back to 
# "co_county_black_stops", which overwrites the existing contents of 
# "co_county_black_stops"
co_county_black_stops<-co_county_black_stops %>%
                        rename(black_stops=black)

Let’s now print the modified co_county_black_stops object:

# Prints contents of "co_county_black_stops"
co_county_black_stops
# A tibble: 65 × 3
# Rowwise:  county_name
   county_name       black_stops total_stops
   <chr>                   <int>       <int>
 1 Adams County             1208       34350
 2 Alamosa County             43        4478
 3 Arapahoe County          1819       17520
 4 Archuleta County           28        5091
 5 Baca County                61        1511
 6 Bent County                46        1808
 7 Boulder County            192       13053
 8 Broomfield County          22        1095
 9 Chaffee County             37        6521
10 Cheyenne County            38        1106
# … with 55 more rows

Note that the name of the column has successfully been changed, and is now more descriptive.

Finally, if we view co_county_black_stops within the R Studio data viewer (View(co_county_black_stops)), we’ll note a small quirk in the dataset. In particular, while Colorado has 64 counties, the dataset has 65 rows; one of the rows is an extra row, with an “NA” value associated with the “county_name” field, and 20 total traffic stops associated with the “NA” county. Because we’re interested in a county-level analysis, and these stops are not associated with an actual county, we’ll go ahead and delete that row with the following code, which takes the current co_county_black_stops object, and then uses the filter() function to select only those rows for which “county_name” is NOT EQUAl (!=) to “NA”; after effectively deleting the row where “county_name” is set to “NA”, the code assigns the modified dataset back to co_county_black_stops:

# Takes the "co_county_black_stops" object and removes the row for which 
# "county_name" is "NA"; assigns the modified dataset back to 
# "co_county_black_stops", which overwrites the dataset that is 
# currently assigned to that object
co_county_black_stops<-co_county_black_stops %>% 
                        filter(county_name!="NA")

Note that co_county_black_stops now contains 64 rows:

# Prints contents of "co_county_black_stops"
co_county_black_stops
# A tibble: 64 × 3
# Rowwise:  county_name
   county_name       black_stops total_stops
   <chr>                   <int>       <int>
 1 Adams County             1208       34350
 2 Alamosa County             43        4478
 3 Arapahoe County          1819       17520
 4 Archuleta County           28        5091
 5 Baca County                61        1511
 6 Bent County                46        1808
 7 Boulder County            192       13053
 8 Broomfield County          22        1095
 9 Chaffee County             37        6521
10 Cheyenne County            38        1106
# … with 54 more rows

5 Defining an index of racial bias in traffic stops

Let’s briefly take stock of where we are. We started with a massive dataset (over 3,000,000 observations) of traffic patrol stops in the state of Colorado over the course of almost a decade. Over several steps, we worked our way to a county-level dataset containing information on the total number of traffic stops, and the number of those stops involving a Black motorist, over the course of the year 2010. That dataset looks something like this:

# Prints contents of "co_county_black_stops"
co_county_black_stops
# A tibble: 64 × 3
# Rowwise:  county_name
   county_name       black_stops total_stops
   <chr>                   <int>       <int>
 1 Adams County             1208       34350
 2 Alamosa County             43        4478
 3 Arapahoe County          1819       17520
 4 Archuleta County           28        5091
 5 Baca County                61        1511
 6 Bent County                46        1808
 7 Boulder County            192       13053
 8 Broomfield County          22        1095
 9 Chaffee County             37        6521
10 Cheyenne County            38        1106
# … with 54 more rows

Now, let’s think about how to use this information to develop a measure of the extent to which traffic police stops may have been driven by racial bias (whether conscious or unconscious) against Black motorists. We might expect that in a discrimination-free world, the share of traffic stops for a given racial group will reflect that group’s share of the overall adult (over-17) population (we’ll consider the over-17 population as our baseline, since that is the demographic eligible to drive). For example, if the Black percentage of a given county’s adult population is 5%, we could form a simple baseline expectation that in a discrimination-free world where “driving while Black” is not effectively criminalized, the percentage of traffic stops involving Black motorists would not exceed 5%. To the extent that the percentage of traffic stops involving Black motorists did exceed 5%, we might assume a statistical pattern consistent with racial discrimination.

Based on this logic, a simple county-level indicator of anti-Black bias in traffic stops would simply be the difference between the percentage of traffic stops in a given county involving Black motorists, and the percentage of the county’s overall population that is Black:

County-Level Traffic Stop Bias Index= (Percentage of County Traffic Stops Involving Black Motorists)-(Percentage of County’s Adult Population that is Black)

As the value of this difference rises above 0, and we see higher positive values for this simple bias index, we might infer higher degrees of discrimination in traffic stops.

Of course, this is a very simple index, and sets aside many complexities (for example, commuting and driving patterns, among other things). However, despite its simplicity, the intuition behind the index is often used in the social science literature on the topic (Stelter et al., 2021); at least to a first approximation, therefore, this index will give us a meaningful way to identify counties in which the behavior of Traffic Patrol might merit greater public scrutiny, on account of disproportionately high stop-rates for the Black population.

Let’s create this index for Colorado counties in the year 2010. Note that we already have the data to compute the first part of the bias index (i.e. the percentage of traffic stops involving Black motorists); this is contained in co_county_black_stops:

# prints contents of "co_county_black_stops"
co_county_black_stops
# A tibble: 64 × 3
# Rowwise:  county_name
   county_name       black_stops total_stops
   <chr>                   <int>       <int>
 1 Adams County             1208       34350
 2 Alamosa County             43        4478
 3 Arapahoe County          1819       17520
 4 Archuleta County           28        5091
 5 Baca County                61        1511
 6 Bent County                46        1808
 7 Boulder County            192       13053
 8 Broomfield County          22        1095
 9 Chaffee County             37        6521
10 Cheyenne County            38        1106
# … with 54 more rows

Let’s therefore turn to calculating the second part of the index: the percentage of each county’s population adult population that is Black. We’ll carry out this calculation using demographic data from the 2010 decennial census.

5.1 Read in 2010 census data

First, let’s read in the census dataset that was provided to you at the start of workshop (for more information on how the data was extracted, please see the tutorial’s Appendix) using the read_csv() function, and assign it to an object named co_counties_census_2010:

# Reads in census data and assigns to object named "co_counties_census_2010"
co_counties_census_2010<-read_csv("co_county_decennial_census.csv")

── Column specification ────────────────────────────────────────────────────────
cols(
  GEOID = col_character(),
  County = col_character(),
  total_pop = col_double(),
  total_black_pop_over17 = col_double(),
  total_pop_over17 = col_double()
)

Let’s print the contents of this census dataset:

# Prints contents of "co_counties_census_2010"
co_counties_census_2010
# A tibble: 64 × 5
   GEOID County          total_pop total_black_pop_over17 total_pop_over17
   <chr> <chr>               <dbl>                  <dbl>            <dbl>
 1 08023 Costilla County      3524                     18             2788
 2 08025 Crowley County       5823                    556             5034
 3 08027 Custer County        4255                     37             3525
 4 08029 Delta County        30952                    139            24101
 5 08031 Denver County      600158                  45338           471392
 6 08035 Douglas County     285465                   2447           198453
 7 08033 Dolores County       2064                      4             1602
 8 08049 Grand County        14843                     43            11825
 9 08039 Elbert County       23086                    122            17232
10 08041 El Paso County     622263                  27280           459587
# … with 54 more rows

Note that the “total_pop” variable contains county-level information on the overall population (i.e. all age groups and all racial categories), the “total_pop_over17” variable contains information on the adult (over-17) population (for all racial groups), and the “total_black_pop_over17” variable contains information on the adult (over-17) Black population for each county. We can use the latter two variables to compute the second part of the bias index we defined above.

5.2 Join census data to co_county_black_stops

Before calculating the second component of the bias index, however, so, , let’s join the census dataset we just viewed (co_counties_census_2010) to the dataset containing information on county-level police stops (co_county_black_stops); this will allow us to calculate both components of the bias index from the single dataset that results from this join.

The code below uses the full_join() function to join co_counties_census_2010 (the second argument), to co_county_black_stops (the first argument), using the “county_name” column (from co_county_black_stops) and the “County” column (from co_counties_census_2010) as the join fields; it assigns the dataset which results from the join to a new object named co_counties_census_trafficstops:

# Joins "co_counties_census_2010" to "co_counties_black_stops" using the
# "county_name" and "County" fields as the join fields; assigns the product 
# of the join to an object named "co_counties_census_trafficstops"

co_counties_census_trafficstops<-
  full_join(co_county_black_stops, co_counties_census_2010,
            by=c("county_name"="County"))

When we open co_counties_census_trafficstops, we should see information from both of the constituent datasets (co_county_black_stops and co_counties_census_2010):

# Prints contents of "co_counties_census_trafficstops"
co_counties_census_trafficstops
# A tibble: 64 × 7
# Rowwise:  county_name
   county_name       black_stops total_stops GEOID total_pop total_black_pop_ov…
   <chr>                   <int>       <int> <chr>     <dbl>               <dbl>
 1 Adams County             1208       34350 08001    441603                9396
 2 Alamosa County             43        4478 08003     15445                 142
 3 Arapahoe County          1819       17520 08005    572003               40558
 4 Archuleta County           28        5091 08007     12084                  19
 5 Baca County                61        1511 08009      3788                  15
 6 Bent County                46        1808 08011      6499                 486
 7 Boulder County            192       13053 08013    294567                1961
 8 Broomfield County          22        1095 08014     55889                 415
 9 Chaffee County             37        6521 08015     17809                 264
10 Cheyenne County            38        1106 08017      1836                   4
# … with 54 more rows, and 1 more variable: total_pop_over17 <dbl>

5.3 Define the variables that will be used in the bias index

Now that we have all the information required to calculate both components of the bias index in one dataset, let’s calculate the constituent parts of the bias index. The code below takes the current co_counties_census_trafficstops object that we created above, and then uses the mutate() function to create two new variables. The first variable, named “black_stop_pct”, is defined as the percentage of traffic stops that involved a Black motorist. The second variable, named “black_pop_pct”, is defined as the percentage of a given county’s adult population that is Black. The dataset that includes these two new variables is assigned back to co_counties_census_trafficstops:

# Takes "co_counties_census_trafficstops" and then creates two new variables; 
# one new variable is named "black_stop_pct" (the Black percentage of 
# traffic stops) and the other is "black_pop_pct" (the Black percentage 
# of the over-17 population; assigns the updated dataset back to
# co_counties_census_trafficstops"

co_counties_census_trafficstops<-
  co_counties_census_trafficstops %>% 
      mutate(black_stop_pct=((black_stops/total_stops)*100),
             black_pop_pct=((total_black_pop_over17/total_pop_over17)*100))

Having created these two new variables (i.e. the components of the bias index we’ll calculate below) in co_counties_census_trafficstops, let’s view the updated dataset:

# Prints contents of "co_counties_census_trafficstops" object
co_counties_census_trafficstops
# A tibble: 64 × 9
# Rowwise:  county_name
   county_name       black_stop_pct black_pop_pct black_stops total_stops GEOID
   <chr>                      <dbl>         <dbl>       <int>       <int> <chr>
 1 Adams County               3.52          2.98         1208       34350 08001
 2 Alamosa County             0.960         1.22           43        4478 08003
 3 Arapahoe County           10.4           9.55         1819       17520 08005
 4 Archuleta County           0.550         0.196          28        5091 08007
 5 Baca County                4.04          0.504          61        1511 08009
 6 Bent County                2.54          9.00           46        1808 08011
 7 Boulder County             1.47          0.846         192       13053 08013
 8 Broomfield County          2.01          1.01           22        1095 08014
 9 Chaffee County             0.567         1.78           37        6521 08015
10 Cheyenne County            3.44          0.289          38        1106 08017
# … with 54 more rows, and 3 more variables: total_pop <dbl>,
#   total_black_pop_over17 <dbl>, total_pop_over17 <dbl>

5.4 Calculate the bias index

Before using the county-level “black_stop_pct” and “black_pop_pct” variables we created above to generate county-level measures of our bias index, let’s first calculate an aggregated bias index for the state of Colorado as a whole.

The mini-script below uses the information in co_counties_census_trafficstops to calculate:

  • The total number of traffic stops involving Black motorists across all counties (assigned to the colorado_total_black_stops object)
  • The total number of traffic patrol stops across all racial categories (assigned to the colorado_total_stops object).
  • The total adult (over-17)) Black population in Colorado (assigned to colorado_adult_blackpopulation).
  • The total adult (over-17) population in Colorado, across all races (assigned to colorado_adult_overall).
  • The percentage of traffic patrol stops involving a Black motorist in Colorado, by dividing colorado_total_black_stops by colorado_total_stops and multiplying by 100 (assigned to black_stop_pct_CO).
  • The Black percentage of Colorado’s overall population, by dividing colorado_adult_blackpopulation by colorado_adult_overall and multiplying by 100 (assigned to black_over17_population_pct_CO).

Finally, it uses black_stop_pct_CO and black_over17_population_pct_CO to generate a state-wide value for the bias index, by subtracting the latter from the former; this value is assigned to the object named CO_bias_index:

# Calculates the total number of Black traffic stops in Colorado and assigns 
# the value to an object named "colorado_total_black_stops"
colorado_total_black_stops<-sum(co_counties_census_trafficstops$black_stops, 
                                na.rm=T)

# Calculates the total number of traffic stops in Colorado and assigns the 
# value to an object named "colorado_total_stops"
colorado_total_stops<-sum(co_counties_census_trafficstops$total_stops, na.rm=T)

# Calculates the total adult (over-17) Black population in CO and assigns the value to 
# an object named "colorado_adult_blackpopulation"
colorado_adult_blackpopulation<-
  sum(co_counties_census_trafficstops$total_black_pop_over17, na.rm=T)

# Calculates the total adult (over-17) population in CO and assigns the value to an 
# object named "colorado_adult_overall"
colorado_adult_overall<-sum(co_counties_census_trafficstops$total_pop_over17, 
                            na.rm=T)

# Calculates the percentage of CO traffic patrol stops involving a Black motorist
# and assigns the value to an object named "black_stop_pct_CO"
black_stop_pct_CO<-(colorado_total_black_stops/colorado_total_stops)*100

# Calculates the percentage of the Colorado's 17+ population that is Black and 
# assigns the value to an object named "black_over17_population_pct_CO"
black_over17_population_pct_CO<-(colorado_adult_blackpopulation/
                                   colorado_adult_overall)*100

# Calculates the bias index for the state of Colorado as a whole, by computing 
# the difference between "black_stop_pct_CO" and 
# "black_over17_population_pct_CO"; assigns the result to an object 
# named "CO_bias_index"
CO_bias_index<-(black_stop_pct_CO)-(black_over17_population_pct_CO)

Now, let’s print the value of CO_bias_index:

# Prints value of "CO_bias_index"
CO_bias_index
[1] -1.112616

Interestingly, the value of the aggregate state-level bias index we’ve calculated is negative; specifically, this result suggests that in 2010, the Black share of traffic stops in Colorado was about 1.11 percentage points less than the Black share of Colorado’s adult population. At first glance, this might suggest that “driving while black” was not criminalized in Colorado (during 2010, the year under consideration), according to our simple bias indicator.

However, it might still be the case that there are specific areas in the state where racial bias in traffic stops is a problem, and focusing on an aggregated state-level measure of the bias index obscures possible micro-level variation in patterns of anti-Black bias with respect to traffic stops. Documenting this micro-level variation is an important task, since it would allow us to identify “problem areas” that might be excessively punitive towards Black drivers. The remainder of the lesson attempts to document this county-level micro-geography of systemic bias in traffic stops (as measured by the index we’ve defined).

We’ll begin by creating a new variable in co_counties_census_trafficstops, which we’ll call “bias_index”, that contains the bias index for each county in the dataset (calculated by subtracting “black_pop_pct” from “black_stop_pct” for each county). The code below creates this new “bias_index” variable, and assigns the dataset with this new variable back to the co_counties_census_trafficstops object:

# Creates new variable in "co_counties_census_trafficstops" named "bias_index" 
# that is defined as the difference between the existing "black_stop_pct" and   
# "black_pop_pct" variables
co_counties_census_trafficstops<-
  co_counties_census_trafficstops %>% 
      mutate(bias_index=black_stop_pct-black_pop_pct)

Note that the county-level “bias_index” variable is now in the dataset assigned to the co_counties_census_trafficstops object:

# prints contents of "co_counties_census_trafficstops"
co_counties_census_trafficstops
# A tibble: 64 × 10
# Rowwise:  county_name
   county_name   bias_index black_stop_pct black_pop_pct black_stops total_stops
   <chr>              <dbl>          <dbl>         <dbl>       <int>       <int>
 1 Adams County       0.538          3.52          2.98         1208       34350
 2 Alamosa Coun…     -0.262          0.960         1.22           43        4478
 3 Arapahoe Cou…      0.832         10.4           9.55         1819       17520
 4 Archuleta Co…      0.354          0.550         0.196          28        5091
 5 Baca County        3.53           4.04          0.504          61        1511
 6 Bent County       -6.45           2.54          9.00           46        1808
 7 Boulder Coun…      0.625          1.47          0.846         192       13053
 8 Broomfield C…      1.00           2.01          1.01           22        1095
 9 Chaffee Coun…     -1.21           0.567         1.78           37        6521
10 Cheyenne Cou…      3.15           3.44          0.289          38        1106
# … with 54 more rows, and 4 more variables: GEOID <chr>, total_pop <dbl>,
#   total_black_pop_over17 <dbl>, total_pop_over17 <dbl>

5.5 Compute summary statistics for the bias index

Now that we have our county-level measure of the bias index, let’s document the spatial distribution of “bias_index” with respect to counties. A simple way to do so would be to compute some basic summary statistics for the “bias_index” variable in co_counties_census_trafficstops, which will give us a sense of how the bias index varies across counties. We can generate these summary statistics using the summary() function.

Below, the argument to the summary() function, which reads co_counties_census_trafficstops$bias_index, specifies that we want summary statistics for the “bias_index” variable that is in co_counties_census_trafficstops:

# Produces summary statistics for "bias_index" variable calculated above
summary(co_counties_census_trafficstops$bias_index)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
-9.8430  0.1978  0.4727  0.5278  0.9832  4.2589       2 

This table of summary statistics shows that the mean bias index across counties during 2010 was 0.53 and the median value of the bias index was 0.47; this clearly suggests that even though the bias index for the state as a whole was negative, there were many counties with a positive bias index. In other words, many Colorado counties appear to have been stopping black motorists at disproportionately high rates relative to their share of the county population. Moreover, there is a fairly large range in the bias index with respect to counties; the county with the largest value for “bias_index” had an index of 4.26 (i.e. the share of Black motorists stopped by the police was 4.26 percentage points higher than the share of Black county residents in the over-17 age demographic), while the county with the smallest value for the index had an index value of -9.84 (i.e. the share of Black motorists stopped by police was 9.84 percentage points lower than the share of Black county residents in the 17+ age demographic). These basic results suggest that a few outlier parts of the state where racial bias (as we are measuring it here) does not appear to be a problem are driving down the state’s overall bias index, despite the fact that in many areas of Colorado, Black motorists are indeed subject to disproportionately aggressive policing.

One way to quickly explore this possibility further might be to create a quick visualization that conveys county-level variation in the bias index. In Section 5.6, we’ll visualize this county-level variation in the bias index on a simple graph.

5.6 Visualize county-level variation in the bias index using ggplot

There are many possible ways to visualize county-level variation in the “bias_index” variable. Here, we’ll make a simple bar graph using ggplot2, a popular visualization package that is part of the tidyverse suite of packages.

Before making this graph, let’s quickly make a new column in co_counties_census_trafficstops, named “County”, that takes the information in the existing “county_name” field and removes the part of the string that reads “County”. This will result in a cleaner-looking graph, since we can more concisely convey that our geographic units are counties in the graph’s main title.

The following code takes the dataset currently assigned to theco_counties_census_trafficstops object, and then uses the mutate() function to create a new field named “County”, which is populated by taking the existing “county_name” column and using the str_remove() function to remove the part of the “county_name” string that reads “County”. It then assigns the modified dataset back to the co_counties_census_trafficstops object, which overwrites the previous version of the dataset:

# Creates new variable named "County" that removes "County" string
# from "county_name" variable
co_counties_census_trafficstops<-
  co_counties_census_trafficstops %>% 
     mutate(County=str_remove(county_name, " County"))

Let’s confirm that the new “County” field has been successfully created:

# prints "co_counties_census_trafficstops"
co_counties_census_trafficstops
# A tibble: 64 × 11
# Rowwise:  county_name
   county_name       County  bias_index black_stop_pct black_pop_pct black_stops
   <chr>             <chr>        <dbl>          <dbl>         <dbl>       <int>
 1 Adams County      Adams        0.538          3.52          2.98         1208
 2 Alamosa County    Alamosa     -0.262          0.960         1.22           43
 3 Arapahoe County   Arapah…      0.832         10.4           9.55         1819
 4 Archuleta County  Archul…      0.354          0.550         0.196          28
 5 Baca County       Baca         3.53           4.04          0.504          61
 6 Bent County       Bent        -6.45           2.54          9.00           46
 7 Boulder County    Boulder      0.625          1.47          0.846         192
 8 Broomfield County Broomf…      1.00           2.01          1.01           22
 9 Chaffee County    Chaffee     -1.21           0.567         1.78           37
10 Cheyenne County   Cheyen…      3.15           3.44          0.289          38
# … with 54 more rows, and 5 more variables: total_stops <int>, GEOID <chr>,
#   total_pop <dbl>, total_black_pop_over17 <dbl>, total_pop_over17 <dbl>

Now, we’re ready to use ggplot2 to make our bar graph.

The following code takes co_counties_trafficstops, and then:

  • Uses the drop_na() function to remove counties for which the “bias_index” value has an “NA” value (there are two such counties), so that they do not show up on the graph.
  • It then initiates ggplot2 by calling the ggplot() function.
  • It then uses the geom_col() function to indicate that the desired output is a bar graph. Within the geom_col() function, the expression that reads aes(x=County, y-bias_index) specifies which variables we want to represent on the graph, and how we want to represent those variables (i.e. which variable do we want on the x- axis, and which variable on the y-axis?). The instructions used to translate the variables in a tabular dataset into a visual representation are known as an “aesthetic mapping”, which is abbreviated within the grammar of ggplot2 to aes.
  • After specifiying that we want a bar graph based on the co_counties_census_trafficstops object (with the “County” column mapped to the x-axis and the “bias_index” column mapped to the y-axis), we use the labs() function (short for “labels”) to designate the graph’s main title, and the labels for the x and y axes.
  • In ggplot2, the theme() function is a versatile function that helps to customize the appearance of a ggplot2 object; here, we set the axis.title.x argument equal to element_text(size=9), which effectively sets the size of the x axis labels. In addition, we set the plot.title argument within the theme() function equal to element_text(hjust=0.5), which effectively center-justifies the plot’s main title. Next, we use the scale_y_continuous() function to set the interval breaks of the y-axis; this range is defined using a vector passed to the breaks argument, where each numeric element of the vector denotes a desired interval break.
  • The expand_limits() function allows us to expand the range of either axis beyond the range established by the axis interval markers; here, by specifying y=c(-10,10) as an argument to the expand_limits() function, we effectively stretch out the y-axis range from -10 to 10, even though the highest Y-axis tick-mark is set at 7.5.
  • Finally, we assign the plot that is created with this code to a new object named bias_graph:
# Uses ggplot2 to create bar graph of "bias_index" variable, with "bias_index" 
# on Y-axis and counties on X-axis
bias_graph<-
  co_counties_census_trafficstops %>% 
  drop_na(bias_index) %>% 
    ggplot()+
      geom_col(aes(x=County, y=bias_index))+
      labs(title="Racial Discrimination in Police Stops, by County (Colorado)", 
           x="County", 
           y="Bias Index=\nBlack Percentage of Overall Traffic Stops – 
           Black Percentage of Overall Over-17 Population")+
       theme(axis.title.x = element_text(size = 9), 
             plot.title=element_text(hjust=0.5))+      
       scale_y_continuous(breaks=c(-10, -7.5, -5, -2.5, 0, 2.5, 5, 7.5))+
       expand_limits(y=c(-10,10))

Let’s print the contents of bias_graph and see what the plot looks like; it will appear in the “plots” tab on the bottom-right of your R Studio interface.

# Prints contents of "bias_graph"
bias_graph

As we can see, the graph does show that certain counties have values around zero, or well below zero on the bias index, which drives down the overall value for the state; but in many other counties, the value of the “bias_index” variable is positive and substantively large.

While this graph does convey valuable information, it’s a little bit confusing to look at; for example, the county names are squished together in a way that makes them unreadable.

A quick way to make the plot more readable is to invert the axes (such that counties are on the y-axis, and the bias index is arrayed along the x-axis). The code below is largely the same as the code in the previous code block. The differences are twofold, and contribute to a change in the map’s appearance. First, the argument to geom_col looks a bit different. Instead of simply specifying (aes(x=County, y=bias_index), we instead specify (aes(x=reorder(County, bias_index), y=bias_index). The reorder() function is essentially specifying that we want the x-axis variable (“County”), to be arrayed in descending order with respect to “bias_index”. Arraying the values in descending order makes for a graph that is easier to read. Second, we call the coord_flip() function, which inverts the axes such that the x-axis and y-axis (specified as arguments to geom_col()) are inverted. We’ll assign this new plot to a new object named bias_graph_inverted:

# Uses ggplot to create bar graph of "bias_index" variable, with "bias_index" 
# on X-axis and counties on y-axis; assigned to object named 
# "bias_graph_inverted"
bias_graph_inverted<-
  co_counties_census_trafficstops %>% 
    drop_na(bias_index) %>% 
      ggplot()+
        geom_col(aes(x=reorder(County, bias_index), y=bias_index))+
        coord_flip()+
       labs(
         title="Racial Discrimination in Police Stops, by County (Colorado)", 
         x="County", 
        y="Bias Index=\nBlack Percentage of Overall Traffic Stops-
        Black Percentage of Overall Over-17 Population")+
        theme(axis.title.x = element_text(size = 9), 
              plot.title=element_text(hjust=0.5))+      
        scale_y_continuous(breaks=c(-10, -7.5, -5, -2.5, 0, 2.5, 5, 7.5))+
        expand_limits(y=c(-10,10))

Let’s view the revised plot:

# prints contents of "bias_graph_inverted"
bias_graph_inverted

We can see that those relatively small changes produced a more readable and informative plot. We can get a clear sense of how the traffic stops bias index varies across Colorado counties, and identify the counties where policing practices appear excessively punitive towards Black drivers. The plot clearly conveys why focusing solely on aggregated statistics might be problematic: such aggregated measures have the potential to obscure more granular patterns of bias across a state’s jurisdictions.

6 Mapping the bias index

However, while the plot above allows us to identify the counties where CO State Patrol policing practices may deserve greater scrutiny, it is difficult to contextualize that information without a clear sense of where these counties are located. For example, we might want to know whether there are geographic clusters of counties with particularly high or low values for the bias index. And to the extent we find such geographic patterns, we can empower communities to address problems of systemic discrimination in an empirically-informed and geographically-aware manner.

In short, counties are spatial units, and creating a visualization where we can explicitly situate our “bias_index” variable in a meaningful geographic context might prove even more informative than a plot such as the one developed in the previous section (which of course does not provide information about where counties are located). In other words, it would be useful to display the bias index on a county-level map of Colorado, which will allow us to get a clearer sense of the granular spatial distribution of the bias index across the state. This section will walk through the process of creating such a map.

6.1 Read in and view the shapefile of CO counties

In order to visualize our bias index on a map of Colorado counties, we need to first load a spatial dataset of Colorado counties into R Studio. A spatial dataset is simply a dataset that has geographic information attached to it; this geographic information can be used by geospatial software to render a map.

Let’s load in the spatial dataset of Colorado counties that was provided to you at the start of the workshop. In particular, the spatial dataset that was provided to you is stored as a shapefile, which is a commonly used file format to store spatial datasets. We can load shapefiles into R Studio using the st_read() function from the sf package. Below, we’ll load in our shapefile, and assign it to a new object named co_counties_shapefile. Note that a given shapefile is comprised of several different files with various extensions; make sure that all of these files are in your working directory. However, the file name passed to the st_read() function must have an “.shp” extension, as below:

# Reads in shapefile and assigns to object named "co_counties_shapefile"
co_counties_shapefile<-st_read("tl_2019_08_county.shp")
Reading layer `tl_2019_08_county' from data source `/Users/adra7980/Documents/git_repositories/jmgl_sopp_workshop/co_counties/tl_2019_08_county/tl_2019_08_county.shp' using driver `ESRI Shapefile'
Simple feature collection with 64 features and 17 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -109.0602 ymin: 36.99245 xmax: -102.0415 ymax: 41.00344
geographic CRS: NAD83

Upon reading in the shapefile, you’ll notice that some metadata about the shapefile is printed to the console. We see that the shapefile has a geometry type of “multipolygon” (other possible geometry types are points and lines), that there are 64 rows and 17 columns in the dataset, and that its coordinate reference system is NAD83. Coordinate reference systems are important to consider if you plan to do spatial analysis that involves calculations with spatially defined attributes; since we don’t plan to carry out an analysis of this nature (we’re only interested in using the spatial dataset as a way to visualize data), we can set this concept aside for the purpose of our workshop.

Now, let’s print the contents of co_counties_shapefile:

# Prints contents of "co_counties_shapefile"
co_counties_shapefile
Simple feature collection with 64 features and 17 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -109.0602 ymin: 36.99245 xmax: -102.0415 ymax: 41.00344
geographic CRS: NAD83
First 10 features:
                         geometry STATEFP COUNTYFP COUNTYNS GEOID     NAME
1  MULTIPOLYGON (((-106.8714 3...      08      109 00198170 08109 Saguache
2  MULTIPOLYGON (((-102.6521 4...      08      115 00198173 08115 Sedgwick
3  MULTIPOLYGON (((-102.5769 3...      08      017 00198124 08017 Cheyenne
4  MULTIPOLYGON (((-105.7969 3...      08      027 00198129 08027   Custer
5  MULTIPOLYGON (((-108.2952 3...      08      067 00198148 08067 La Plata
6  MULTIPOLYGON (((-107.9751 3...      08      111 00198171 08111 San Juan
7  MULTIPOLYGON (((-106.9154 3...      08      097 00198164 08097   Pitkin
8  MULTIPOLYGON (((-105.9751 3...      08      093 00198162 08093     Park
9  MULTIPOLYGON (((-106.0393 3...      08      003 00198117 08003  Alamosa
10 MULTIPOLYGON (((-102.2111 3...      08      099 00198165 08099  Prowers
          NAMELSAD LSAD CLASSFP MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT      ALAND
1  Saguache County   06      H1 G4020  <NA>   <NA>     <NA>        A 8206547705
2  Sedgwick County   06      H1 G4020  <NA>   <NA>     <NA>        A 1419419016
3  Cheyenne County   06      H1 G4020  <NA>   <NA>     <NA>        A 4605713960
4    Custer County   06      H1 G4020  <NA>   <NA>     <NA>        A 1913031921
5  La Plata County   06      H1 G4020  <NA>  20420     <NA>        A 4376255148
6  San Juan County   06      H1 G4020  <NA>   <NA>     <NA>        A 1003660672
7    Pitkin County   06      H1 G4020   233  24060     <NA>        A 2514104907
8      Park County   06      H1 G4020   216  19740     <NA>        A 5682182508
9   Alamosa County   06      H1 G4020  <NA>   <NA>     <NA>        A 1871465874
10  Prowers County   06      H1 G4020  <NA>   <NA>     <NA>        A 4243429484
     AWATER    INTPTLAT     INTPTLON
1   4454510 +38.0316514 -106.2346662
2   3530746 +40.8715679 -102.3553579
3   8166129 +38.8356456 -102.6017914
4   3364150 +38.1019955 -105.3735123
5  25642578 +37.2873673 -107.8397178
6   2035929 +37.7810492 -107.6702567
7   6472577 +39.2175376 -106.9161587
8  43519840 +39.1189141 -105.7176479
9   1847610 +37.5684423 -105.7880414
10 15345176 +37.9581814 -102.3921613

Notice that this looks very much like a typical dataset, one that we can also view in R Studio’s data viewer with View(co_counties_shapefile). The key element that makes this dataset different from a conventional dataset is contained in the “geometry” column. For each county, the geometry column contains a series of latitude/longitude pairs corresponding to that county’s borders in the “real world”. These lat/long pairs are processed by GIS software, and used to render a visual representation of a county’s geographic borders that reflects its “real-world” shape and location. When each row is rendered simultaneously, the result is a map of Colorado counties that can be used as a foundation for data visualization and analysis.

Within R Studio, we can use functions from the tmap package to visually render a spatial dataset’s geographic attributes (stored in the “geometry” column), and thereby create a map. The code below uses the geometry information in co_counties_shapefile to render the Colorado county polygons as a map. First, we pass the name of the spatial object we’d like to map (co_counties_shapefile) to the tm_shape() function, and then use the tm_polygons() function to instruct the mapping utility that the spatial data in the “geometry” column must be interpreted and rendered as polygons (note that the two tmap functions are connected by a + sign, which is standard tmap syntax):

## tmap mode set to plotting
# Uses "geometry" information in "co_counties_shapefile" to create map of CO 
# counties
tm_shape(co_counties_shapefile)+
  tm_polygons()

You should see a map that looks like this in the “Plots” tab on the bottom-right of your R Studio interface.

Below, we’ll assign this basic map, which was rendered from co_counties_shapefile ,to a new object named co_counties_map:

# assigns map of CO polygons to new object named "co_counties_map"
co_counties_map<-tm_shape(co_counties_shapefile)+
                    tm_polygons()

Now, whenever we want to retrieve the map, we can simply print the name of the object to which it has been assigned:

# prints contents of "co_counties_map"
co_counties_map

In using tmap to translate the geographic information in the “geometry” column of our spatial dataset of Colorado counties into a visual representation, we produced a static map; in other words, we cannot do things like pan around the map, or zoom in and out. However, if we want to render a dynamic map where such things are possible, we can simply use the tmap_mode() function to change the setting of the tmap environment to “view”, as below:

# Sets tmap mode to "view"
tmap_mode("view")
## tmap mode set to interactive viewing

Now, if we open the co_counties_map object that we created earlier, tmap will render a dynamic and interactive map:

# Prints "co_counties_map" as a web map
co_counties_map

You will be able to view this interactive map in the “Viewer” tab on the bottom right of your R Studio interface. This interactive map is essentially a web map, which we can easily export as an HTML file and embed on a website (if we so choose).

If we want to switch back to making static maps, we can switch back to “plot” mode using the same tmap_mode() function:

# Switches tmap mode back to "plot"
tmap_mode("plot")
## tmap mode set to plotting

Once we’re back in “plot” mode, tmap will render spatial objects as static maps. For instance, if we open co_counties_map again, it will render as a static map:

# prints "co_counties_map" as a static plot
co_counties_map

As we work with spatial data and maps in R Studio using tmap, we can easily toggle back and forth between “view” and “plot” modes, depending on our desired outputs.

6.2 Join co_counties_census_trafficstops to the shapefile of Colorado counties

Now that we’ve loaded and explored our spatial dataset of Colorado counties, and rendered it as a map, let’s turn to the process of displaying the bias index on this county-level map.

In order to display our data of interest on the map of Colorado counties that we generated above, we must first join the data on the bias index to the spatial dataset; once the bias index data is in our spatial dataset, we can use tmap functions to render a map that displays this data on the county polygons.

We can implement the join between a spatial dataset and a tabular dataset in much the same way as we implemented a join between two tabular datasets above in Section 6.2 (between the traffic stops dataset and the census dataset).

Below, the first argument to the full_join() function is the name of our spatial object of Colorado counties; the second argument is the name of the object which contains the “bias_index” data (which we want to merge into the spatial dataset). Finally, the third argument indicates that we want to join these datasets together using the field named “GEOID” (which is the same in both datasets), as the join field. We’ll assign the expanded spatial dataset that results from the join to a new object named county_shapefile_biasIndex:

# Joins "co_counties_census_trafficstops" into "co_counties_shapefile" 
# using GEOID as the join field; assigns the new joined dataset to a new 
# object named "county_shapefile_biasIndex"
county_shapefile_biasIndex<-
  full_join(co_counties_shapefile, 
            co_counties_census_trafficstops, by="GEOID")

Now, when we open county_shapefile_biasIndex, we should see the “bias_index” variable in the dataset, along with the “geometry” information needed to render a map of Colorado counties:

# prints contents of "county_shapefile_biasIndex"
county_shapefile_biasIndex
Simple feature collection with 64 features and 27 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -109.0602 ymin: 36.99245 xmax: -102.0415 ymax: 41.00344
geographic CRS: NAD83
First 10 features:
       NAME                       geometry bias_index STATEFP COUNTYFP COUNTYNS
1  Saguache MULTIPOLYGON (((-106.8714 3...  0.5591577      08      109 00198170
2  Sedgwick MULTIPOLYGON (((-102.6521 4...  3.0473002      08      115 00198173
3  Cheyenne MULTIPOLYGON (((-102.5769 3...  3.1472044      08      017 00198124
4    Custer MULTIPOLYGON (((-105.7969 3... -0.2021878      08      027 00198129
5  La Plata MULTIPOLYGON (((-108.2952 3...  0.2272495      08      067 00198148
6  San Juan MULTIPOLYGON (((-107.9751 3...  0.8000000      08      111 00198171
7    Pitkin MULTIPOLYGON (((-106.9154 3...  0.3336883      08      097 00198164
8      Park MULTIPOLYGON (((-105.9751 3...  0.3973331      08      093 00198162
9   Alamosa MULTIPOLYGON (((-106.0393 3... -0.2620964      08      003 00198117
10  Prowers MULTIPOLYGON (((-102.2111 3...  3.2319999      08      099 00198165
   GEOID        NAMELSAD LSAD CLASSFP MTFCC CSAFP CBSAFP METDIVFP FUNCSTAT
1  08109 Saguache County   06      H1 G4020  <NA>   <NA>     <NA>        A
2  08115 Sedgwick County   06      H1 G4020  <NA>   <NA>     <NA>        A
3  08017 Cheyenne County   06      H1 G4020  <NA>   <NA>     <NA>        A
4  08027   Custer County   06      H1 G4020  <NA>   <NA>     <NA>        A
5  08067 La Plata County   06      H1 G4020  <NA>  20420     <NA>        A
6  08111 San Juan County   06      H1 G4020  <NA>   <NA>     <NA>        A
7  08097   Pitkin County   06      H1 G4020   233  24060     <NA>        A
8  08093     Park County   06      H1 G4020   216  19740     <NA>        A
9  08003  Alamosa County   06      H1 G4020  <NA>   <NA>     <NA>        A
10 08099  Prowers County   06      H1 G4020  <NA>   <NA>     <NA>        A
        ALAND   AWATER    INTPTLAT     INTPTLON     county_name   County
1  8206547705  4454510 +38.0316514 -106.2346662 Saguache County Saguache
2  1419419016  3530746 +40.8715679 -102.3553579 Sedgwick County Sedgwick
3  4605713960  8166129 +38.8356456 -102.6017914 Cheyenne County Cheyenne
4  1913031921  3364150 +38.1019955 -105.3735123   Custer County   Custer
5  4376255148 25642578 +37.2873673 -107.8397178 La Plata County La Plata
6  1003660672  2035929 +37.7810492 -107.6702567 San Juan County San Juan
7  2514104907  6472577 +39.2175376 -106.9161587   Pitkin County   Pitkin
8  5682182508 43519840 +39.1189141 -105.7176479     Park County     Park
9  1871465874  1847610 +37.5684423 -105.7880414  Alamosa County  Alamosa
10 4243429484 15345176 +37.9581814 -102.3921613  Prowers County  Prowers
   black_stop_pct black_pop_pct black_stops total_stops total_pop
1       0.7296607     0.1705030          20        2741      6108
2       3.4120735     0.3647733          26         762      2379
3       3.4358047     0.2886003          38        1106      1836
4       0.8474576     1.0496454           1         118      4255
5       0.6191950     0.3919455          70       11305     51334
6       0.8000000     0.0000000           1         125       699
7       0.8213552     0.4876670           4         487     17148
8       0.7943403     0.3970072          64        8057     16206
9       0.9602501     1.2223466          43        4478     15445
10      3.7458295     0.5138297         247        6594     12551
   total_black_pop_over17 total_pop_over17
1                       8             4692
2                       7             1919
3                       4             1386
4                      37             3525
5                     160            40822
6                       0              571
7                      69            14149
8                      52            13098
9                     142            11617
10                     47             9147

6.3 Display the bias index on a map of Colorado counties

At this point, with our “bias_index” variable included in a spatially explicit dataset of Colorado counties, we are ready to visualize this data on a map, and observe how this measure of bias in traffic police stops varies geographically across counties.

6.3.1 Make a rough draft of a map

Let’s start by making the simplest possible map of the “bias_index” variable. Recall that earlier, we drew the Colorado county polygons with the following:

# Uses tmap to render Colorado polygons based on geometry information in 
# "county_shapefile_biasIndex" object
tm_shape(county_shapefile_biasIndex)+
  tm_polygons()

Now, to display actual county-level attribute data from the dataset on the map of Colorado counties, we can simply pass an argument to the tm_polygons() function. In particular, we can specify the column that contains the data we want to represent on the map, using the col argument to the tm_polygons() function. In our case, the name of the column that contains the data we want to display on the map is “bias_index”, so we will specify col="bias_index" within the tm_polygons() function:

# Creates map of "bias_index" variable from "county_shapefile_biasIndex" 
# spatial object
tm_shape(county_shapefile_biasIndex)+
  tm_polygons(col="bias_index")
## Variable(s) "bias_index" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

As you can see, we now have a county-level map of the Colorado “bias_index” data we created earlier. Admittedly, this map is very rough; important elements of the map, such as the legend interval breaks and the color scheme, were chosen arbitrarily by tmap, and these arbitrary settings hinder the ability of the map to effectively convey information about the spatial distribution of “bias_index” across Colorado counties.

However, tmap allows us to customize our maps, and explicitly specify parameters that shape the map’s appearance. Let’s begin exploring some of these customization possibilities. In the code below, we still start out by passing county_shapefile_biasIndex to the tm_shape() function (so as to specify the spatial object we’d like to map), and passing col="bias_index to the tm_polygons() function (as before).

However, we’ll begin customizing the map by passing additional arguments to the tm_polygons() function:

  • With respect to the color scheme, we’ll designate a yellow-to-red color palette with palette="YlOrRd", and set 0 as the the point in the data distribution that corresponds to the midpoint of the palette (in other words, the hypothetical county where “bias_index” is exactly zero will be displayed in the color palette’s median color). For more information on color options in R (including color and palette codes), see this useful R Color cheatsheet.
  • In the next three arguments to tm_polygons(), we’ll begin customizing the map’s legend. Setting textNA="No Data" specifies that the legend should label the category for counties with NA values for the bias index as “No Data”, rather than the default label, which is “Missing”. Setting n=5 establishes that we want to group our data into five distinct intervals, which means our legend will have five breaks. Finally, setting style="jenks" specifies that we want to use the jenks algorithm to decide where to place those legend breaks (or in other words, how to break up the data into five distinct intervals). For more For more information on the Jenks Natural Breaks Classification, as well as other data partition algorithms, see this useful guide to data classification methods.
  • Finally, we call the tm_layout() function, which allows us to customize the map’s layout; within this function, we set frame=FALSE (which removes the map’s frame, or bounding box) and legend.outside=TRUE (which places the legend outside the map’s domain, so as to not interfere with the display of counties).
# Creates map of "bias_index" from the "county_shapefile_biasIndex" object with 
# yellow-to-red color scheme, five legend partitions based on the Jenks algorithm, 
# and without a map frame/bounding box; counties with NA values are labeled with 
# "No Data" in the legend, and the map's legend is placed outside the polygon display
  tm_shape(county_shapefile_biasIndex)+
      tm_polygons(col="bias_index", 
                  palette="YlOrRd", 
                  midpoint=0,
                  textNA="No Data", 
                  n=5,
                  style="jenks")+
      tm_layout(frame=FALSE, 
               legend.outside=TRUE)

This is starting to look better; in the next two subsections, we’ll explore how to further control the map’s appearance by setting custom breaks and custom color schemes.

6.3.2 Make a map with custom breaks

Let’s say that instead of using the Jenks (or some other algorithm) to partition our data into intervals, we want to set our own data intervals. Doing so might make sense here, especially since we want to clearly differentiate counties with a bias index less than or equal to zero from those that are above zero. We can specify our legend breaks in a numeric vector (i.e. a sequence of numbers) passed to the breaks argument within the tm_polygons() function. Below, we set breaks=c(-10,-5,0,2,4,5), which indicates that we want our intervals to be from -10 to -5; -5 to 0; 0 to 2; 2 to 4; and 4 to 5. Note that the “c” before the sequence of numbers bounded by parentheses is actually a function, which is used to indicate that the sequence of elements that follows must be interpreted as a vector. Also, note that we’ve removed the style="jenks" argument that we used above, since we’re using custom legend breaks (rather than breaks implemented by the jenks algorithm). Other than those changes, other elements of the code below are the same as that used in the previous code block.

# Creates map of "bias_index" from the "county_shapefile_biasIndex" object with 
# yellow-to-red color scheme, five legend partitions based on a vector passed to 
# the "breaks" argument, and without a map frame/bounding box; counties with NA values 
# are labeled with "No Data" in the legend, and the map's legend is placed outside 
# the polygon display
  tm_shape(county_shapefile_biasIndex)+
    tm_polygons(col="bias_index", 
                palette="YlOrRd", 
                midpoint=0,
                textNA="No Data",
                breaks=c(-10,-5, 0, 2, 4, 5))+
    tm_layout(frame=FALSE, 
              legend.outside=TRUE)

6.3.3 Make a map with custom colors

So far, we’ve been using a predefined color template (“YlOrRd”) to display the range of values for “bias_index” on our map. While this color scheme might be a good start, it can sometimes be difficult to distinguish the colors on the map. One possible way to fix this might be to explore other possible predefined color schemes, and find one that makes colors easier to distinguish, given the data distribution for “bias_index”. Another possibility is to specify our own colors for map’s intervals. In order to do this, let’s first first define a character vector, assigned to an object named my_colors, that contains the colors we want to use (once again, a reminder color names are available on the R Color Cheatsheet):

# defines vector of colors and assigns vector to an object named "my_colors"
my_colors<-c("white", "peachpuff", "red1", "red4", "navy")

Now, we can pass this vector as an argument to the tm_polygons() function. Instead of setting palette=YlOrRd (as above), we instead set palette=my_colors. The colors in the my_colors vector are assigned to the numeric intervals in order; that is, the interval from -10 to -5 is assigned the color “white” (the first color in the vector), the interval from -5 to 0 is assigned the color “peachpuff” (the second color in the vector), the interval from 0 to 2 is assigned the color “red1” (the third color in the vector), and so on. Everything else in the code remains the same as in the previous section:

# Creates map of "bias_index" from "county_shapefile_biasIndex" object using custom 
# color palette defined in the "my_colors" vector
tm_shape(county_shapefile_biasIndex)+
  tm_polygons(col="bias_index", 
              palette=my_colors, 
              textNA="No Data",
              breaks=c(-10,-5, 0, 2, 4, 5))+
  tm_layout(frame=FALSE, 
            legend.outside=TRUE)

We can see that these colors are easier to distinguish, which makes it easier to quickly scan the map for relevant patterns.

It is also worth reminding ourselves that it’s possible to assign the maps we create using tmap to objects. For example, let’s assign the last map we created to a new object named traffic_bias_map_continuous:

# Creates map showing variation in "bias_index" across CO counties, and assigns 
# the map to object named "traffic_stop_map_continuous"
traffic_bias_map_continuous<-
tm_shape(county_shapefile_biasIndex)+
  tm_polygons(col="bias_index", 
              palette=my_colors, 
              textNA="No Data", 
              breaks=c(-10,-5, 0, 2, 4, 5))+
  tm_layout(frame=FALSE, 
            legend.outside=TRUE)

Now, we can bring up this map whenever we need it by printing the name of the object to which it is assigned:

# Prints contents of "traffic_stop_map_continuous"
traffic_bias_map_continuous

6.4 Make a categorical map

So far, we have been mapping the bias_index variable, which is a continuous variable. This has the advantage of allowing us to visualize the full extent of variation in the bias_index variable. However, there are also other ways we might visualize the data. For example, we could transform bias_index from a continuous numeric variable into a categorical variable, and visualize this categorical variable on a map.

More specifically, let’s say we want to use a map to clearly distinguish the counties where racial bias in traffic stops appears to be a problem (where “bias_index”>0) and those counties in which it does NOT appear to be a problem (where “bias_index”<=0). On the one hand, this would throw out useful information regarding the full extent of variation for the “bias_index” indicator, but on the other hand, it could yield a more stark and focused map.

To build such a map, the first step is to create a new categorical variable, based on the continuous “bias_index” variable, within county_shapefile_biasIndex. In the code below, we take the existing spatial dataset assigned to the county_shapefile_biasIndex object, and then use the mutate() function to create a new variable named “apparent_bias.” This new “apparent_bias” variable will be coded as “Apparent Bias” for counties where “bias_index”>0, and coded as “No Apparent Bias” for all other counties (i.e. where the bias index is less than or equal to zero). This is accomplished using the ifelse() function. The first argument to the ifelse() function is a given condition (here bias_index>0). The second argument specifies the value that the new “apparent_bias” variable should take when that condition is true; the third argument specifies the value that the “apparent_bias” variable should take when that condition is false. After creating and defining this new variable, we assign the changes back to the county_shapefile_biasIndex object, which overwrites the previous version of the dataset.

# Takes the existing dataset assigned to the "county_shapefile_biasIndex" object, 
# and creates a new variable named "apparent_bias"; this variable takes  on the 
# value "Apparent Bias" where the "bias_index" variable is # >0 and "No Apparent Bias" 
# where it is less than or equal to zero; these changes are then assigned back to 
# the "county_shapefile_biasIndex" object
county_shapefile_biasIndex<-
  county_shapefile_biasIndex %>% 
            mutate(apparent_bias=ifelse(bias_index>0, 
                                        "Apparent Bias", 
                                        "No Apparent Bias"))

Let’s take a look at what the new variable looks like:

# prints contents of "county_shapefile_biasIndex"
county_shapefile_biasIndex
Simple feature collection with 64 features and 28 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -109.0602 ymin: 36.99245 xmax: -102.0415 ymax: 41.00344
geographic CRS: NAD83
First 10 features:
       NAME                       geometry bias_index    apparent_bias STATEFP
1  Saguache MULTIPOLYGON (((-106.8714 3...  0.5591577    Apparent Bias      08
2  Sedgwick MULTIPOLYGON (((-102.6521 4...  3.0473002    Apparent Bias      08
3  Cheyenne MULTIPOLYGON (((-102.5769 3...  3.1472044    Apparent Bias      08
4    Custer MULTIPOLYGON (((-105.7969 3... -0.2021878 No Apparent Bias      08
5  La Plata MULTIPOLYGON (((-108.2952 3...  0.2272495    Apparent Bias      08
6  San Juan MULTIPOLYGON (((-107.9751 3...  0.8000000    Apparent Bias      08
7    Pitkin MULTIPOLYGON (((-106.9154 3...  0.3336883    Apparent Bias      08
8      Park MULTIPOLYGON (((-105.9751 3...  0.3973331    Apparent Bias      08
9   Alamosa MULTIPOLYGON (((-106.0393 3... -0.2620964 No Apparent Bias      08
10  Prowers MULTIPOLYGON (((-102.2111 3...  3.2319999    Apparent Bias      08
   COUNTYFP COUNTYNS GEOID        NAMELSAD LSAD CLASSFP MTFCC CSAFP CBSAFP
1       109 00198170 08109 Saguache County   06      H1 G4020  <NA>   <NA>
2       115 00198173 08115 Sedgwick County   06      H1 G4020  <NA>   <NA>
3       017 00198124 08017 Cheyenne County   06      H1 G4020  <NA>   <NA>
4       027 00198129 08027   Custer County   06      H1 G4020  <NA>   <NA>
5       067 00198148 08067 La Plata County   06      H1 G4020  <NA>  20420
6       111 00198171 08111 San Juan County   06      H1 G4020  <NA>   <NA>
7       097 00198164 08097   Pitkin County   06      H1 G4020   233  24060
8       093 00198162 08093     Park County   06      H1 G4020   216  19740
9       003 00198117 08003  Alamosa County   06      H1 G4020  <NA>   <NA>
10      099 00198165 08099  Prowers County   06      H1 G4020  <NA>   <NA>
   METDIVFP FUNCSTAT      ALAND   AWATER    INTPTLAT     INTPTLON
1      <NA>        A 8206547705  4454510 +38.0316514 -106.2346662
2      <NA>        A 1419419016  3530746 +40.8715679 -102.3553579
3      <NA>        A 4605713960  8166129 +38.8356456 -102.6017914
4      <NA>        A 1913031921  3364150 +38.1019955 -105.3735123
5      <NA>        A 4376255148 25642578 +37.2873673 -107.8397178
6      <NA>        A 1003660672  2035929 +37.7810492 -107.6702567
7      <NA>        A 2514104907  6472577 +39.2175376 -106.9161587
8      <NA>        A 5682182508 43519840 +39.1189141 -105.7176479
9      <NA>        A 1871465874  1847610 +37.5684423 -105.7880414
10     <NA>        A 4243429484 15345176 +37.9581814 -102.3921613
       county_name   County black_stop_pct black_pop_pct black_stops
1  Saguache County Saguache      0.7296607     0.1705030          20
2  Sedgwick County Sedgwick      3.4120735     0.3647733          26
3  Cheyenne County Cheyenne      3.4358047     0.2886003          38
4    Custer County   Custer      0.8474576     1.0496454           1
5  La Plata County La Plata      0.6191950     0.3919455          70
6  San Juan County San Juan      0.8000000     0.0000000           1
7    Pitkin County   Pitkin      0.8213552     0.4876670           4
8      Park County     Park      0.7943403     0.3970072          64
9   Alamosa County  Alamosa      0.9602501     1.2223466          43
10  Prowers County  Prowers      3.7458295     0.5138297         247
   total_stops total_pop total_black_pop_over17 total_pop_over17
1         2741      6108                      8             4692
2          762      2379                      7             1919
3         1106      1836                      4             1386
4          118      4255                     37             3525
5        11305     51334                    160            40822
6          125       699                      0              571
7          487     17148                     69            14149
8         8057     16206                     52            13098
9         4478     15445                    142            11617
10        6594     12551                     47             9147

Now that we have created this new categorical variable, let’s go ahead and create a map that displays it on our map of Colorado counties. The code below looks very similar to the code used to map the original “bias_index” variable. The main difference is that instead of setting col=bias_index, we set col="apparent_bias", since we now want to map the categorical “apparent_bias” variable, rather than the continuous “bias_index” variable. Another difference worth pointing out is that we use a different, bipartite color scheme (since there are now only two categories to map); this color scheme is defined by the vector c("orangered1", "white"), which is passed to the “palette” argument within the tm_polygons function. Finally, in the previous map, the legend’s title was taken from the name of the column that was mapped; here, having this title would be redundant, so we can remove the legend title using title="". We’ll assign this map of the categorical “apparent_bias” variable to a new object named traffic_bias_map_categorical:

# Creates county-level map of categorical "apparent_bias" variable and 
# assigns it to an object named "traffic_bias_map_categorical"
traffic_bias_map_categorical<-
  tm_shape(county_shapefile_biasIndex)+
      tm_polygons(col="apparent_bias", 
                  title="", 
                  palette=c("orangered1", "white"), 
                  textNA="No Data")+
      tm_layout(frame=FALSE, 
                legend.outside=TRUE)

If we print the contents of traffic_bias_map_categorical, we will see a map that looks something like this:

# traffic_bias_map_categorical
traffic_bias_map_categorical

7 Refining and formatting maps

In the previous section, we created and refined two maps based on the index of racial bias in traffic stops we created earlier in the tutorial. The first map showed the spatial distribution of the continuous “bias_index” variable with respect to counties in Colorado.

It looked like this:

# prints "traffic_bias_map_continuous"
traffic_bias_map_continuous

The second map created a new categorical variable based on “bias_index”, named “apparent_bias”, and displayed this categorical variable on a map.

It looked like this:

# prints "traffic_bias_map_categorical"
traffic_bias_map_categorical

In this section, we’ll continue to refine and customize these maps (for example, by adding titles, map credits, county labels, and labels and titles for the legend).

7.1 Refine the map of the continuous “bias_index” variable

Let’s start by fine-tuning the map of the continuous “bias_index” variable. As an exercise, read through the code below (without scrolling down to the map), and see if you can decipher what each line is doing, forming a mental picture of the in-progress map as you go. If you’re having trouble understanding something, look up the function’s documentation, which you can access by typing the name of the function, preceded by a question mark. For example, if you want to access the documentation for the tm_polygons() function, you can type ?tm_polygons, which will bring up the function’s documentation (which provides a description of the various arguments to the function), in the “Help” tab on the bottom-right of the R Studio interface.

my_colors<-c("white", "peachpuff", "red1", "red4", "navy")

traffic_bias_map_continuous<-
tm_shape(county_shapefile_biasIndex)+
  tm_polygons(col="bias_index", 
              palette=my_colors,
              title="(Black % of Traffic Stops) - (Black % of Population)",
              textNA="No Data", 
              breaks=c(-10,-5, 0.01, 2, 4, 5))+
  tm_layout(frame=FALSE, 
            legend.outside=TRUE,
            legend.text.size=0.68,
            legend.title.size=0.75,
            title.size=0.75,
            title.fontface = 2,
            title="Traffic Stop Racial Bias Index",
            main.title="Racial Bias in Traffic Stops, by County (Colorado)",
            main.title.position=0.03,
            main.title.size=1,
            attr.outside=TRUE)+
  tm_credits("Map Author: NAME\nData Sources: 2010 Decennial Census via tidycensus,\nStanford Open Policing Project,\nColorado GeoLibrary ", 
            position=c(0.02,0.01),
            size=0.38)

Having read through this code, let’s see what that the resulting map looks like:

# prints updated map assigned to "traffic_stop_map_continuous" object
traffic_bias_map_continuous

Now, let’s unpack the various components of the code.

  • The code begins by defining the my_colors vector; this is the same color vector we used in Section 6.3.3, and is reproduced for convenience here.
  • traffic_bias_map_continuous<- indicates that the map created through the code on the right side of the assignment operator will be assigned to the object named traffic_bias_map_continuous. This overwrites the previous version of the map assigned to this object.
  • The first function that is called is tm_shape(); we pass the name of the object that contains the data we want to map to this function, which yields tm_shape(county_shapefile_biasIndex).
  • After specifying our dataset with tm_shape(), we call the tm_polygons() function. The tm_polygons() function takes several arguments, beginning with col="bias_index (which specifies the column within county_shapefile_biasIndex that we want to map), and palette=my_colors (which specifies that we want to use the colors from the my_colors vector to represent different interval ranges for the “bias_index” variable). Next, in title="(Black % of Traffic Stops) - (Black % of Population), we define a subtitle for the legend, so that viewers of the map can discern the intuition behind the bias index that is being mapped. Finally, textNA="No Data" defines the legend label for “NA” values, and breaks=c(-10,-5, 0.01, 2, 4, 5) sets the data intervals, which are communicated in the map’s legend. Notice that instead of setting one of the break points at 0, this version of the map uses 0.01; that is because the lower bound on data intervals is inclusive, while the upper bound is exclusive (i.e. a hypothetical county with a bias index of exactly 4.00 would be assigned the color that corresponds with the 4.00 to 5.00 legend interval, not the one corresponding to 2.00 to 4.00). As a result, a county with a bias index value of exactly zero would be grouped with counties with a positive value on the index, when it would make more sense to group such a county with counties with negative values on the index. Setting the break point slightly above 0 avoids a scenario where a county with an index value of exactly 0 is grouped together with counties with an index value greater than zero.
  • The next tmap function that is called is the tm_layout() function, which allows us to specify various parameters that shape the map’s layout. Recall from before that frame=FALSE removes the map’s bounding box (which surrounds the area displayed on the map), while legend.outside=TRUE places the legend outside the map’s domain, so that it doesn’t overlap with any of the counties that are displayed. Next, legend.text.size=0.68 sets the size for legend text elements (that are not part of the legend’s title or subtitle), while legend.title.size=0.75 sets the size of the legend’s subtitle, and title.size=0.75 sets the size of the text in the legend’s main title. Note that when setting the size of text elements on your map, the best approach to use is usually trial and error; start with an arbitrary size, and then iterate from there. title="Traffic Stop Racial Bias Index" sets the text for legend’s main title, and title.fontface = 2 sets the legend’s main title text in boldface. main.title="Racial Bias in Traffic Stops, by County (Colorado)" sets the map’s main title, main.title.position=0.03 sets the position/justification of the title, and main.title.size=1 sets the size of the main title’s text. Finally, attr.outside=TRUE specifies that any map elements other than the legend (such as the map credits) are to be placed outside the map boundary.
  • The tm_credits() function is a tmap function which allows us to add a credits section to the map, to convey information about things such as the name of the map author and the sources of the data used to create the map. In the code above, the first argument to the tm_credits() function is simply the text we would like to include in the credits; new lines are indicated by text that reads \n. Second, position=c(0.02,0.01) specifies the position of the credits section with respect to the map (the first element of the vector, 0.02, can be thought of as the x-coordinate, while the second element, 0.01, can be thought of as a y-coordinate); these coordinates place the credits below the map on the left-hand side. Finally, size=0.38 specifies the size of the text used in the credits section.

Notice that our map currently does not have labels for county names; these can easily be added using the tm_text() function that is part of the tmap package. Instead of rewriting all of the code from above, we can simply append the tm_text() function and its arguments to the map object we already created above (traffic_bias_map_continuous). The code below takes the existing traffic_bias_map_continous object, which we updated above, and adds a line of code that reads tm_text("NAME", size=0.30, fontface=2). This labels the map’s polygons using information from the underlying dataset’s “NAME” field (which contains county names), using a text size of 0.3; the argument fontface=2 specifies that the county name labels are to be printed in boldface. No other changes or additions are made to traffic_bias_map_continuous. The version of the map that is labeled is assigned to a new object, named traffic_bias_map_continuous_labeled.

# Adds labels to "traffic_bias_map_continuous" and assigns new labeled map
# to new object named "traffic_bias_map_continuous_labeled"
traffic_bias_map_continuous_labeled<-
  traffic_bias_map_continuous+
    tm_text("NAME", size=0.30, fontface=2)

When we open traffic_bias_map_continuous_labeled, we can see the map with labels:

# prints contents of "traffic_bias_map_continuous_labeled"
traffic_bias_map_continuous_labeled

7.2 Refine the categorical map

The code below refines the categorical map we created in Section 6.4. The map is largely the same as the one created in that section, but adds and customizes the appearance of the map’s title using arguments to the tm_layout() function that we have already discussed in Section 7.1, and also adds a map credits section that is identical to the one we used in 7.1. The map is assigned to the traffic_bias_map_categorical object, which overwrites the map we previously assigned to this object in Section 6.4.

# Creates a map of the categorical "apparent_bias" variable, with counties 
# where there is apparent bias in traffic police stops displayed in "orangered1", 
# and counties without apparent bias displayed in "white"
traffic_bias_map_categorical<-
tm_shape(county_shapefile_biasIndex)+
  tm_polygons(col="apparent_bias", 
              title="", 
              palette=c("orangered1", "white"), 
              textNA="No Data")+
    tm_layout(frame=FALSE, 
              legend.outside=TRUE,
              main.title="Racial Bias in Traffic Stops, by County in Colorado\n(Based on Whether % of Black Motorists Stopped by Police > County's Adult Black Population %)",
              main.title.position=0.03,
              main.title.size=0.75,
              attr.outside=TRUE)+
  tm_credits("Map Author: NAME\nData Sources: 2010 Decennial Census via tidycensus,\nStanford Open Policing Project,\nColorado GeoLibrary ",
              position=c(0.02,0.01), # Specifies location of map credits
             size=0.38)+
  tm_text("NAME", size=0.30, fontface=2)

We can print the contents of traffic_bias_map_categorical to view the updated map in the “Plots” window:

# prints contents of "traffic_bias_map_categorical"
traffic_bias_map_categorical

7.3 Create a dynamic map of the bias index

Recall from Section 6.1 that it is possible to render both static maps and interactive maps using the tmap package. Thus far, our focus has been on creating static maps to display county-level variation in the “bias_index” variable (or the derivative categorical variable, “apparent_bias”). It’s worth briefly reminding ourselves that we can easily turn these static maps into dynamic maps by changing the tmap mode to “view.” For example, let’s say we want to display an interactive version of the map assigned to traffic_bias_map_continuous:

# Changes tmap mode to "view"
tmap_mode("view")
## tmap mode set to interactive viewing
# Prints "traffic_bias_map_continuous" in View mode
traffic_bias_map_continuous
## Credits not supported in view mode.

Before proceeding with the lesson, please toggle back to “plot” mode.

8 Exporting maps

Once you have made a map and are satisfied with its appearance, it is straightforward to export the map from R Studio to a local directory on your computer, where you can then embed your map in papers, projects, presentations, or websites.

The easiest way to export maps is by using the tmap_save() function. This function allows users to specify the desired dimensions and resolution of the exported map, among other things. For our purposes, we’ll simply export the map assigned to traffic_bias_map_continuous_labeled using the default settings. Below, the first argument is the name of the map object we’d like to export; here, that is traffic_bias_map_continuous_labeled. The second argument is the desired file name and extension; it is possible to pick any number of extensions, such as PDF, jpeg, and png. Here, we’ll export the map as a .png file.

# Exports map assigned to "traffic_bias_map_continuous_labeled" object to working directory; 
# the exported map is saved as "traffic_bias_map_continuous_labeled.png"
tmap_save(traffic_bias_map_continuous_labeled, 
          "traffic_bias_map_continuous_labeled.png")
## Map saved to /Users/adra7980/Documents/git_repositories/jmgl_sopp_workshop/traffic_bias_map_continuous_labeled.png
## Resolution: 2448.943 by 1800.777 pixels
## Size: 8.163142 by 6.002591 inches (300 dpi)

Because we didn’t specify the path to a directory in the second argument, the map will be exported to our working directory.

Note that if you’d like to export a dynamic/interactive map, you can use the same code as above; the only difference is that you would use an “html” extension. The interactive map would then be exported to your working directory as an html file, which can then easily be embedded on a website.

9 Summary scripts

This section summarizes the code we have written over the course of the tutorial to map county-level variation in possible anti-Black bias in Colorado’s 2010 traffic patrol stops. Section 9.1 provides the script to clean and process the original dataset published by the Stanford Open Policing Project, and get that data into a form that is suitable for mapping. Section Section 9.2 provides the script to create a map of the continuous “bias_index” variable, with counties labeled. Section 9.3 provides the script to create a map of a categorical variable that indicates whether a county’s value for “bias_index” is greater than zero, or less than/equal to zero, and then make a map of this categorical variable (with counties labeled).

9.1 Summary script to prepare, clean, and process data for mapping

# Read in Stanford police data for Colorado and assign to object 
# named "co_traffic_stops" 
co_traffic_stops<-read_csv("co_statewide_2020_04_01.csv")

# Create "Year" field based on existing "date" field
co_traffic_stops<-co_traffic_stops %>% 
                    mutate(Year=substr(co_traffic_stops$date, 1,4))

# Filter 2010 observations and assign to a new object named 
# "co_traffic_stops_2010"
co_traffic_stops_2010<-co_traffic_stops %>% filter(Year==2010)

# Compute county-level count of traffic stops by race and assign to object 
# named "co_county_summary"
co_county_summary<-co_traffic_stops_2010 %>% 
                    group_by(county_name) %>% 
                    count(subject_race) 

# Reshape the data so that the racial categories are transposed
# from rows into columns and assign the result to an object named
# "co_county_summary_wide"
co_county_summary_wide<-co_county_summary %>% 
                        pivot_wider(names_from=subject_race, values_from=n)


# Creates a new column named "total_stops" in "co_county_summary_wide" that
# contains information on the total number of stops for each county 
# (across all racial categories)
co_county_summary_wide<-co_county_summary_wide %>% 
                        rowwise() %>% 
                        mutate(total_stops=
                                 sum(c_across(where(is.integer)), na.rm=TRUE))

# Selects "county_name", "black", and "total_stops" variables from
# "co_county_summary_wide"; then renames the "black" variable to 
# "black_stops" for clarity; then removes counties that are named "NA" 
# due to an error in the dataset
co_county_black_stops<-co_county_summary_wide %>%
                        select(county_name, black, total_stops) %>% 
                        rename(black_stops=black) %>% 
                        filter(county_name!="NA")

# Read in the pre-prepared demographic data from the 2010 decennial 
# census and assign to an object named "co_counties_census_2010"
co_counties_census_2010<-read_csv("co_county_decennial_census.csv")

# Join "co_counties_census_2010" to "co_county_black_stops" and assign the result
# to an object named "co_counties_census_trafficstops"
co_counties_census_trafficstops<-full_join(co_county_black_stops, 
                                           co_counties_census_2010,
                                           by=c("county_name"="County"))

# Use the information in "co_counties_census_trafficstops" to define new 
# variables that will be used to compute the racial bias index: 
# "black_stop_pct" (the black percentage of overall traffic stops within
# a county) and "black_pop_pct" (the black percentage of the county's 
#over-17 population)
co_counties_census_trafficstops<-
  co_counties_census_trafficstops %>% 
    mutate(black_stop_pct=((black_stops/total_stops)*100),
          black_pop_pct=((total_black_pop_over17/total_pop_over17)*100))

# Calculate the bias index and include it as a new variable in 
# "co_counties_census_trafficstops"
co_counties_census_trafficstops<-co_counties_census_trafficstops %>% 
                                  mutate(excess_stops_index=
                                           black_stop_pct-black_pop_pct)

# Reads in Colorado county shapefile and assigns the shapefile to a new object 
# named "co_counties_shapefile"
co_counties_shapefile<-st_read("tl_2019_08_county.shp")

# Join "co_counties_census_trafficstops" to "co_counties_shapefile" using 
# "GEOID" as the join field; assign the result to a new object named #
# "county_shapefile_biasIndex"
county_shapefile_biasIndex<-full_join(co_counties_shapefile, co_counties_census_trafficstops, by="GEOID")

9.2 Summary script for map of continuous “bias_index” variable

# creates color vector
my_colors<-c("white", "peachpuff", "red1", "red4", "navy") # defines color palette

# make a map of the continuous "bias_index" variable
traffic_bias_map_continuous_labeled<-  
  tm_shape(county_shapefile_biasIndex)+ 
  tm_polygons(col="bias_index", 
              palette=my_colors, 
              title="(Black % of Traffic Stops) - (Black % of Population)", 
              textNA="No Data", # specifies label for NA data values
              n=5, 
              breaks=c(-10,-5, 0.01, 2, 4, 5))+ 
  tm_layout(frame=FALSE, 
            legend.outside=TRUE, 
            legend.text.size=0.68, 
            legend.title.size=0.75,  
            title="Traffic Stop Racial Bias Index", 
            title.size=0.75, 
            title.fontface = 2,
            main.title="Racial Bias in Traffic Stops, by County (Colorado)",
            main.title.position=0.03, 
            main.title.size=1, 
            attr.outside=TRUE)+ 
  tm_credits("Map Author: NAME\nData Sources: 2010 Decennial Census via tidycensus,\nStanford Open Policing Project,\nColorado GeoLibrary ", 
            position=c(0.02,0.01), 
            size=0.38)+ 
  tm_text("NAME", 
          size=0.30, 
          fontface=2)
# Prints map
traffic_bias_map_continuous_labeled

9.3 Summary script for categorical map

# Makes categorical variable based on continuous "bias_index" variable; this categorical variable is named "apparent_bias", and is coded as "Apparent Bias" if "bias_index">0 and coded as "No Apparent Bias" if "bias_index"<=0
county_shapefile_biasIndex<-
  county_shapefile_biasIndex %>% 
            mutate(apparent_bias=ifelse(bias_index>0, "Apparent Bias", 
                                        "No Apparent Bias"))

# Makes categorical map based on "apparent_bias" variable and assigns this map to object named 
# "traffic_bias_map_categorical"
traffic_bias_map_categorical<-
  tm_shape(county_shapefile_biasIndex)+
    tm_polygons(col="apparent_bias", 
                title="", 
                palette=c("orangered1", "white"), 
                textNA="No Data")+ 
    tm_layout(frame=FALSE, 
              legend.outside=TRUE, 
              main.title="Racial Bias in Traffic Stops, by County in Colorado\n(Based on Whether % of Black Motorists Stopped by Police > County's Adult Black Population %)", 
              main.title.position=0.03, 
              main.title.size=0.75, 
              attr.outside=TRUE)+ 
  tm_credits("Map Author: NAME\nData Sources: 2010 Decennial Census via tidycensus,\nStanford Open Policing Project,\nColorado GeoLibrary ",  
              position=c(0.02,0.01), 
             size=0.38)+ 
  tm_text("NAME", 
          size=0.30, 
          fontface=2) 
# prints categorical map
traffic_bias_map_categorical

10 Reflections and Discussion Questions

Now that we’ve created our maps, it’s worthwhile to reflect on the process and its outcomes. Consider the following:

  • How might you change the maps we made? For example, would you change the color schemes, number of intervals, title, or any other features of their appearance? If so, why? See if you can change the code we developed to implement those changes.
  • Reflect on the process of “cleaning” and reorganizing the original Stanford Policing Project dataset (this process is summarized in Section 9.1). Why were these steps necessary in order to generate our maps? What information was lost in the course of making these data transformations? What research questions might the original dataset be able to answer, that the “cleaned” dataset that results from these transformations could not answer?
  • What do you think are the possible shortcomings of the “bias_index” variable we created? Can you think of an alternative measure of racial bias in traffic stops that addresses these shortcomings? Could you create it with the data we already have? If not, what data would you need to collect?
  • What spatial patterns (if any) do you notice in how the “bias_index” is distributed across counties? Are there regions of the state where several counties with high values for the bias index are clustered? What might explain these patterns?
  • What implications might these maps have for public policy?

11 Conclusion and Further Reading

If you’re interested in exploring projects resources that lie at the intersection of GIS, Black history, and the study of structural racism, ESRI’s Racial Equity Hub is a good place to start.

If you’re interested in exploring a specific example of a large-scale interdisciplinary project on systemic racism that incorporates spatial analysis, check out the Mapping Prejudice project from the University of Minnesota (Ehrman-Solberg et al, 2020). The data for that project is available at the University of Minnesota’s institutional repository.

If you’d like to explore publications that use the Open Policing Data, including the initial paper by Pierson et al (2020), see the Stanford Open Policing Project’s publications page. The Stanford project site also has a useful tutorial that introduces the data and provides some guidance in analyzing it. The tutorial includes a discussion of more advanced ways to measure racial bias in traffic stops than the one we used here, and a useful way to extend your knowledge might be to think about how to create and map those more advanced measures.

If you would like to further develop your spatial visualization and mapping skills, an excellent place to start is with the free and open-source book by Lovelace, Nowosad, and Muenchow (2021), entitled Geocomputation with R. The book is more than an introduction to making maps; it’s a comprehensive guide to spatial analysis and Geographic Information Systems using the R programming language.

12 References

Ehrman-Solberg, K. P. Petersen, M. Mills, K. Delegard, R. Mattke. 2020. Racial Covenants in Hennepin County. University of Minnesota Data Repository. https://doi.org/10.13020/a88t-yb14

Esri. n.d. GIS for Racial Equity: Racial Equity Hub. Accessed February 15, 2022. https://gis-for-racialequity.hub.arcgis.com/

Esri. n.d. Data Classification Methods. Accessed February 14, 2021. https://pro.arcgis.com/en/pro-app/2.7/help/mapping/layer-properties/data-classification-methods.htm

Frazier, M. “R Color Cheatsheet.” Accessed February 10, 2022. https://www.nceas.ucsb.edu/sites/default/files/2020-04/colorPaletteCheatsheet.pdf.

Lovelace, R, J. Nowosad, and J. Muenchow. 2021. Geocomputation With R. https://geocompr.robinlovelace.net/.

Pebesma, E. 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10(1): 439-446. https://doi.org/10.32614/RJ-2018-009.

Pierson, E., C. Simoiu, J. Overgoor, S. Corbett-Davies, D.Jenson, A. Shoemaker, V. Ramachandran, P. Barghouty, C. Phillips, R. Shroff, and S. Goel. 2020. A large-scale analysis of racial disparities in police stops across the United States. Nature Human Behaviour 4: 736-745.

R For Social Scientists: Setup. 2018-2021. Data Carpentry. https://datacarpentry.org/r-socialsci/setup.html.

Stanford Open Policing Project. 2019. Open Policing Project Tutorial. Accessed February 15, 2022. https://openpolicing.stanford.edu/tutorials/.

Stelter, M., I. Essien, C. Sander, and J. Degner. 2021. Racial bias in police traffic stops: White residents’ county-level prejudice and Sterotypes are related to disproportionate stopping of Black drivers." PsyArXiv Preprints. https://psyarxiv.com/djp8g/.

Tennekes, M. 2018. tmap: Thematic Maps in R. Journal of Statistical Software 84(6): 1-39. https://doi.org/10.18637/jss.v084.i06.

U.S. Census Bureau. 2019. Colorado Counties Shapefile. https://geo.colorado.edu/catalog/47540-5e712aeda3d91e0009f59fc7.

Walker, K. and M. Herman. 2021. “tidycensus: Load US Census Boundary and Attribute Data as tidyverse and sf-Read Data Frames.” R Package version 1.1. https://CRAN.R-project.org/package=tidycensus

Wickham, H., M. Averick, J. Bryan, W. Chang, L.D. McGowan, R. Francois, G. Grolemund et al. 2019. "Welcome to the tidycerse. Journal of Open Source Software 4(43): 1686. https://doi.org/10.21105/joss.01686.

13 Appendix: Using tidycensus to extract relevant census data

In Section 5.1, we read in a county-level demographic dataset (extracted from the 2010 decennial census), which we subsequently used to generate one of the components of our bias indicator. That dataset was provided to you at the start of the workshop, and we did not discuss the process by which the dataset was generated in the main lesson. However, if you are curious about how that data was generated, the script in this Appendix can be used as a guide to reproduce the demographic dataset introduced in Section 5.1.

To extract the census demographic data required to create the bias index, we must first load a few required packages. In particular, please load the tidyverse (which was used in the workshop tutorial) and the tidycensus package, which is an R package that allows users to pull Census Bureau data using the Census API (if tidycensus is not already installed, please install it using install.packages("tidycensus").

# Loads libraries required to extract census data
library(tidycensus)
library(tidyverse)

Before extracting data with tidycensus, you must acquire a Census API key from the Census Bureau website; once you apply for a key on the website, your key will be immediately emailed to the address you provide. Enter your census API key in R Studio with the following code, replacing the with your key:

census_api_key("INSERT HERE")

In the workshop, we were working with 2010 police stops data, so it made sense to pull demographic data from the 2010 decennial census (which would have been collected in 2009). The discussion below is therefore framed with respect to the 2010 decennial census; if you choose to use another census data product (such as the American Community Survey) to create the bias index, your code may look slightly different.

13.1 Define your variables

First, we can generate a table that contains the various variables (and associated variable codes) for the 2010 decennial census by using the load_variables() function. The arguments to the function below (2010, "sf1") indicate that we want to extract variable names and codes for the 2010 decennial census. Below, we assign the table to an object named decennial_2010_variables:

# Variable list for 2010 Decennial assigned to object named "decennial_2010_variables"
decennial_2010_variables<-load_variables(2010, "sf1")

At this point, we can easily view the 2010 decennial variables table by passing decennial_2010_variables as an argument to the View() function:

Based on the information in decennial_2010_variables, we can identify the variable codes for the variables we want to extract. Then, we will define a vector (see below), assigned to an object named my_vars that assigns the variable codes to descriptive names; these descriptive names will be used as column names in the dataset returned by the census API call, while the variable codes will be used by tidycensus to populate the respective fields with the desired data.

For the purpose of defining the “bias_index” variable, recall that we needed to define the county-level percentage of traffic stops involving a Black motorist, and the county-level percentage of the adult Black population (with respect to the county’s overall adult population). The first component of this index was derived using the Stanford Open Policing data. In order to calculate the second component of the bias index, we need two key variables at the county level: the total adult (which we define as 17+) population, and the total adult Black population. These two variables can be extracted from the decennial census. However, there is no separate category in the decennial census for these two measures, so we must derive them based on the data that is available.

Therefore, given the available data, calculating the total over-17 population will require us to extract data for the male under-5 population, the male 5 to 9 population, the male 10 to 14 population, and the male 15 to 17 population, and analogous measures for the female population. Subtracting these values from the total overall population (among all age groups) will yield a value for the total over-17 population.

To calculate the Black over-17 population, we will extract a variable that defines the total Black population, and a series of variables that indicate the Black population for different demographic (sex/age) combinations 17 years old and younger; subtracting the sum of the latter variables from the total Black population yields a value for the Black over-17 population.

The code below extracts all the variables needed to carry out these calculations:

# Define and name variables for census API call

my_vars<-c(total_pop="P001001",
           totalpop_men_u5="P012003",
           totalpop_men_5to9="P012004",
           totalpop_men_10to14="P012005",
           totalpop_men_15to17="P012006",
           totalpop_women_u5="P012027",
           totalpop_women_5to9="P012028",
           totalpop_women_10to14="P012029",
           totalpop_women_15to17="P012030",
           black_totalpop="PCT012B001",
           black_men_u1="PCT012B003",
           black_men_1="PCT012B004",
           black_men_2="PCT012B005",
           black_men_3="PCT012B006",
           black_men_4="PCT012B007",
           black_men_5="PCT012B008",
           black_men_6="PCT012B009",
           black_men_7="PCT012B010",
           black_men_8="PCT012B011",
           black_men_9="PCT012B012",
           black_men_10="PCT012B013",
           black_men_11="PCT012B014",
           black_men_12="PCT012B015",
           black_men_13="PCT012B016",
           black_men_14="PCT012B017",
           black_men_15="PCT012B018",
           black_men_16="PCT012B019",
           black_men_17="PCT012B020",
           black_women_u1="PCT012B107",
           black_women_1="PCT012B108",
           black_women_2="PCT012B109",
           black_women_3="PCT012B110",
           black_women_4="PCT012B111",
           black_women_5="PCT012B112",
           black_women_6="PCT012B113",
           black_women_7="PCT012B114",
           black_women_8="PCT012B115",
           black_women_9="PCT012B116",
           black_women_10="PCT012B117",
           black_women_11="PCT012B118",
           black_women_12="PCT012B119",
           black_women_13="PCT012B120",
           black_women_14="PCT012B121",
           black_women_15="PCT012B122",
           black_women_16="PCT012B123",
           black_women_17="PCT012B124")

13.2 Extract the variables using tidycensus

Now that we have a vector of the variables we want to extract (along with descriptive names for those variables), we will use the get_decennial() function from tidycensus to extract these variables from the 2010 decennial census. Several arguments are passed to the get_decennial() function below:

  • geography="county" specifies that we want the census data to be provided at the county level
  • variables=my_vars specifies the variables we want to extract, and the names they are to be given in the dataset; this information is contained in the my_vars vector defined above
  • state=CO specifies the state for which we want to extract the data; this argument, together with the geography="county" argument, means that tidycensus will extract the specified data in my_vars at the county level for the state of Colorado.
  • survey="sf1" indicates which census dataset we would like to query; here sf1 (short for Summary File 1), indicates we are referring to the decennial census (as opposed, for example, to the American Community Survey)
  • output=wide indicates that we want the dataset with the extracted variables to be in “wide” format, wherein each variable is assigned to its own column.
  • year=2010 indicates that we are interested in data from 2010. Combined with survey="sf1", this will extract the 2010 decennial census data.

Finally, we’ll assign the extracted dataset to an object named co_counties_race:

# Issue call to Census API
co_counties_race<-
get_decennial(
  geography="county", 
  variables=my_vars,
  state="CO",
  survey="sf1",
  output="wide",
  year=2010)          
## Getting data from the 2010 decennial Census
## Using Census Summary File 1

We can print the first few lines of the dataset to the console to view its structure, and ensure that everything looks in order:

# prints contents of "co_counties_race"
co_counties_race
# A tibble: 64 × 48
   GEOID NAME        total_pop totalpop_men_u5 totalpop_men_5t… totalpop_men_10…
   <chr> <chr>           <dbl>           <dbl>            <dbl>            <dbl>
 1 08023 Costilla C…      3524              99              102              107
 2 08025 Crowley Co…      5823              88              110              136
 3 08027 Custer Cou…      4255              62               95              113
 4 08029 Delta Coun…     30952             887              913             1042
 5 08031 Denver Cou…    600158           22252            18894            15319
 6 08035 Douglas Co…    285465           11278            13587            12664
 7 08033 Dolores Co…      2064              58               63               71
 8 08049 Grand Coun…     14843             416              421              431
 9 08039 Elbert Cou…     23086             594              790              976
10 08041 El Paso Co…    622263           23152            23050            23252
# … with 54 more rows, and 42 more variables: totalpop_men_15to17 <dbl>,
#   totalpop_women_u5 <dbl>, totalpop_women_5to9 <dbl>,
#   totalpop_women_10to14 <dbl>, totalpop_women_15to17 <dbl>,
#   black_totalpop <dbl>, black_men_u1 <dbl>, black_men_1 <dbl>,
#   black_men_2 <dbl>, black_men_3 <dbl>, black_men_4 <dbl>, black_men_5 <dbl>,
#   black_men_6 <dbl>, black_men_7 <dbl>, black_men_8 <dbl>, black_men_9 <dbl>,
#   black_men_10 <dbl>, black_men_11 <dbl>, black_men_12 <dbl>, …

As always, it is also possible to view the dataset in the R Studio data viewer by running View(co_counties_race).

13.3 Clean the tidycensus dataset

Having extracted the dataset, you may want to tidy it up, depending on your needs and preferences. For example, the “NAME” field includes the name of the state, which is not really necessary here since there are only observations from Colorado in the dataset. The code below creates a new variable named “County”, which is populated by removing the state name from the “NAME” field, and updates the co_counties_race object with this new variable:

# Remove state name from NAME field
co_counties_race<-co_counties_race %>% 
                  mutate(County=str_remove_all(NAME, ", Colorado"))   

Note the newly created “County” variable is now in the dataset:

# Prints contents of "co_counties_race"
co_counties_race
# A tibble: 64 × 49
   GEOID County          NAME         total_pop totalpop_men_u5 totalpop_men_5t…
   <chr> <chr>           <chr>            <dbl>           <dbl>            <dbl>
 1 08023 Costilla County Costilla Co…      3524              99              102
 2 08025 Crowley County  Crowley Cou…      5823              88              110
 3 08027 Custer County   Custer Coun…      4255              62               95
 4 08029 Delta County    Delta Count…     30952             887              913
 5 08031 Denver County   Denver Coun…    600158           22252            18894
 6 08035 Douglas County  Douglas Cou…    285465           11278            13587
 7 08033 Dolores County  Dolores Cou…      2064              58               63
 8 08049 Grand County    Grand Count…     14843             416              421
 9 08039 Elbert County   Elbert Coun…     23086             594              790
10 08041 El Paso County  El Paso Cou…    622263           23152            23050
# … with 54 more rows, and 43 more variables: totalpop_men_10to14 <dbl>,
#   totalpop_men_15to17 <dbl>, totalpop_women_u5 <dbl>,
#   totalpop_women_5to9 <dbl>, totalpop_women_10to14 <dbl>,
#   totalpop_women_15to17 <dbl>, black_totalpop <dbl>, black_men_u1 <dbl>,
#   black_men_1 <dbl>, black_men_2 <dbl>, black_men_3 <dbl>, black_men_4 <dbl>,
#   black_men_5 <dbl>, black_men_6 <dbl>, black_men_7 <dbl>, black_men_8 <dbl>,
#   black_men_9 <dbl>, black_men_10 <dbl>, black_men_11 <dbl>, …

13.4 Define new variables

Now that we have a cleaned dataset with all our necessary variables, we can use these variables to generate the demographic variables needed to calculate the bias index. First, the code below defines a new variable, called “total_pop_over17”, that is calculated by subtracting the total population that is 17 and under from the total overall population:

# Create variable for total over-17 population
co_counties_race<-
  co_counties_race %>% 
     mutate(total_pop_over17=
              total_pop-totalpop_men_u5-totalpop_men_5to9-
              totalpop_men_10to14-totalpop_men_15to17-totalpop_women_u5-
              totalpop_women_5to9-totalpop_women_10to14-totalpop_women_15to17)

Then, we create a new variable named “total_black_pop_over17”, which is defined by subtracting the total Black population that is 17 and under from the total Black population:

# Create variable for total over--17 black population
co_counties_race<-
  co_counties_race %>% 
  mutate(total_black_pop_over17=
          black_totalpop-black_men_u1-black_men_1-
          black_men_2-black_men_3-black_men_4-black_men_5-black_men_6-
          black_men_7-black_men_8-black_men_9-black_men_10-black_men_11-
          black_men_12-black_men_13-black_men_14-black_men_15-black_men_16-
          black_men_17-black_women_u1-black_women_1-black_women_2-black_women_3-
          black_women_4-black_women_5-black_women_6-black_women_7-black_women_8-
          black_women_9-black_women_10-black_women_11-black_women_12-
          black_women_13-black_women_14-black_women_15-black_women_16-
          black_women_17)

13.5 Finalize and export the dataset

Now that we have our two key variables defined, let’s clean up the dataset by removing the variables we no longer need, only keeping the variables necessary to help create the bias index. Below, we take the existing dataset assigned to co_counties_race, and select the “GEOID” and “County” variables (which serve as ID variables), “total_black_pop_over17” and “total_pop_over17” (which are used to compute the bias index), and “total_pop” (which is not necessary to create “bias_index”, but which could prove useful in exploring alternate ways of defining a bias index than the one implemented in the tutorial). The new dataset is assigned to an object named co_counties_census_2010:

# Clean data by select relevant variables for analysis, and assign selection to new object named 
# "co_counties_census_2010"
co_counties_census_2010<-
  co_counties_race %>% 
    select(GEOID, County, total_pop, total_black_pop_over17, total_pop_over17)

Let’s now view the contents of co_counties_census_2010:

# prints contents of "co_counties_census_2010"
co_counties_census_2010
# A tibble: 64 × 5
   GEOID County          total_pop total_black_pop_over17 total_pop_over17
   <chr> <chr>               <dbl>                  <dbl>            <dbl>
 1 08023 Costilla County      3524                     18             2788
 2 08025 Crowley County       5823                    556             5034
 3 08027 Custer County        4255                     37             3525
 4 08029 Delta County        30952                    139            24101
 5 08031 Denver County      600158                  45338           471392
 6 08035 Douglas County     285465                   2447           198453
 7 08033 Dolores County       2064                      4             1602
 8 08049 Grand County        14843                     43            11825
 9 08039 Elbert County       23086                    122            17232
10 08041 El Paso County     622263                  27280           459587
# … with 54 more rows

At this point, we now have the census data that was provided to you at the start of the tutorial.

To export this dataset, use the write_csv() function; below, the first argument is the name of the object which contains the dataset to be exported, and the second argument is the desired file name. The data is exported to the current working directory, and can subsequently be opened on your spreadsheet software of choice as a CSV file.

# Exports the data assigned to the "co_counties_census_2010" object to the working directory
write_csv(co_counties_census_2010, "co_counties_census_2010.csv")